Data Quality Management in three steps
It is estimated that up to 50% of data practitioner’s time is spent on data quality issues [1] and an average of 25% of company’s revenue is lost because of poor data [2]. So, the problem with testing data quality is really not about the “why” but about the “how”. How does one test the quality of something that is growing daily, has constant changes and is prone to errors? Our thoughts on validating data in relational databases is divided into three steps.
For example, we can expect that:
Testing a database for empty tables can be done with a simple COUNT(*) check with the expected result to be larger than zero. While this test might never fail for some database objects, the results can still be analyzed over time to check for deviations in data as the growth can be expected to be linear.
13. January 2021
1. Define expectations
The first and the most difficult step in data validation is defining expectations for the data. It should also be taken into consideration that there is no point in testing something that is not going to be fixed if found erroneous. We can define expectations from different categories, such as architectural and business rules. Following architectural rules is usually a part of building a data warehouse to keep it organized and to assure quality to some extent. This is something that is often overlooked in the development process and can cause problems later on. Verifying business rules is a vital part of data quality assurance as errors can directly affect reports, cause delays in business processes and even cripple other systems dependent on the data.For example, we can expect that:
- tables in the staging area have more than zero rows
- business layer views do not any have duplicate rows
- there are no columns with only empty values
- all customers that exist in CRM also exist in DWH
- the sum of last month’s invoices is ±20% of last 6 month average
2. Generate test cases
Testing data in relational databases is done most efficiently by executing SQL queries and expecting predetermined results. Test cases are generated by using defined expectations and translating them into SQL. The predetermined result of the query is then compared to the actual result, which might be a simple “OK” or “NOK” string from a CASE statement, an aggregated value such as COUNT or SUM, or another dataset to be compared to.Testing a database for empty tables can be done with a simple COUNT(*) check with the expected result to be larger than zero. While this test might never fail for some database objects, the results can still be analyzed over time to check for deviations in data as the growth can be expected to be linear.
SELECT COUNT(*) FROM <object>
An object, especially one presented to users, should never have duplicate records. It can be checked by grouping rows by key columns and searching for records with more than one row. Such a query is expected to return zero rows.
SELECT COUNT(*) FROM (SELECT <key_columns> FROM <object> GROUP BY <key_columns> HAVING COUNT(*)>1)
An object should not have columns that are empty for all records. Such columns can be found by counting rows where the column has a value. If the count of such rows is 0 then this column consists of only NULL values. This should be checked for all columns in the object. The test query is expected to return an empty string by returning the column name for columns that have only NULL values.
SELECT ”
<forEachColumn> || CASE WHEN COUNT(<column_name>)=0 THEN ‘<column_name>;’ ELSE ” END </forEachColumn>
FROM <object>
3. Automate the test process
Automating the previously generated test cases makes it possible to reduce manual effort and test data systematically while allowing to test thousands of tables with millions of records. We can use a DQM platform to automate:- Generating test cases – creating tests can be done by using reusable test cases with parameters that are automatically replaced by database metadata or by user input.
- Executing test cases – query executions can be scheduled to be run on set days and time, or can be integrated into a data pipeline to be triggered before/after certain tasks.
- Analyzing test results – for some aggregated results it would be wise to analyze the results over a period of time. Sudden changes or deviations in data often indicate errors and can be detected by using linear regression to predict results and compare them with actual results within error threshold.
- Reporting results – erroneous data should be reported to responsible users (possibly with instructions on how to react/fix) by sending an automated test report via email/Slack/etc.
Data quality management with LiTech
Validating data in relational databases begins with defining expectations, generating test cases and automating the testing process. LiTech DQM makes it simple and easy to manage in a browser-based application with a built-in test generation, data profiling and comparison engine. It supports all commonly used relational databases and integrates with common enterprise technologies. A free 30 day trial is available here.13. January 2021
[1] https://hbr.org/2013/12/datas-credibility-problem
[2] https://sloanreview.mit.edu/article/seizing-opportunity-in-data-quality/
[2] https://sloanreview.mit.edu/article/seizing-opportunity-in-data-quality/