It is important to know the integrity of your data and how it changes over time as sudden changes or deviations in aggregated values often indicate errors in data. Detecting such errors can be done by predicting a query result based on its historical values using linear regression. The predicted result is then compared to the actual query result within an error threshold. These predictions allow us to detect deviations in aggregated values such as counts, averages and sums. Whether the data has errors in the source database, is loaded poorly or has erroneous transformation rules – using a data prediction engine allows to ease this process by finding anomalies no matter where the data pipeline leaked.
For example, we can detect:
- Errors in daily data loads to data warehouse by analyzing row counts
- Faulty joins in user view logic by detecting sudden changes in row counts
- Missing records by comparing sum of invoices to previous months
Let’s say we want to test a staging area table to detect changes in the daily data loads. We can expect that:
- the table has more than 0 rows
- the row count growth is linear
- the changes do not vary more than ±5%
We can start testing by scheduling a simple COUNT(*) test with expected result of >0. Historical data can now be collected by saving test results with test execution timestamps.
SELECT COUNT(*) FROM <object>
Executing such a test on workdays at 08:30 would provide us with a history:
Timestamp |
Query result |
2021-02-15 08:30:00 |
11739897 |
2021-02-12 08:30:00 |
11723298 |
2021-02-11 08:30:00 |
11707545 |
2021-02-10 08:30:00 |
11692376 |
2021-02-09 08:30:00 |
11676459 |
2021-02-08 08:30:00 |
11659250 |
2021-02-05 08:30:00 |
11641212 |
2021-02-04 08:30:00 |
11625717 |
The math
We can use linear regression to predict the next result based on current timestamp:
[1]
where y = predicted result, α = intercept, β = slope and x = current timestamp in milliseconds.
The slope can be found by first creating two lists: x and y, where x contains timestamps transformed into milliseconds and y contains corresponding query results. The slope is then calculated by using formula:
[1]
where x̄ = average x and ȳ = average y.
The intercept can be calculated by multiplying average timestamp by slope and subracting it from average query result:
[1]
The results
Based on the current history, the latest result (11739897) would be within 0.08% of calculated prediction.
Let’s say there are changes in the source database which causes some of the data not to be loaded into the staging area table. This causes ~700k rows of the expected 11.7M to be missing and not available for reporting. Using linear regression we can predict an error of -6% which exceeds our previously set threshold of ±5%, allowing us to find and fix the error before users start using the data.
By automating the test execution process, prediction calculation and alerting responsible users, we can detect deviations in data whenever data is expected to change. Using LiTech DQM makes it simple to set up and manage. Learn about other features and request a free 30 day trial from
here.
10. March 2021