Why and how to implement object based data profiling

Why and how to implement object based data profiling

Profiling data is a great way to find anomalies and detect sudden changes in data. It creates a useful overview which helps to discover various data quality issues and find deviations. Collecting profiled results allows us to analyse them with each new entry and flag values for possible errors as soon as they appear in the data. This makes it possible to find anomalies without having any domain knowledge or expectations about the data. It is the main difference compared to testing, which requires defined expectations, while profiling creates the expectations based on its historical data. By testing, we can assume that a column has some values that are NOT NULL (hence the expectation of a result COUNT(*) > 0). By profiling, we can analyse how the count of NOT NULL’s changes over time and alert responsible users when the change is not within its threshold.

Sample rules

To start profiling the data, we should first define column based rules that allow us to collect data about the column. The rules should be based on data types, as certain rules can be assigned to some data types, but can’t be assigned to others. For example, we can create three groups:
  • numeric (bigint, decimal, double, float, int, integer, money, numeric, real, etc.)
  • string (char, character, name, string, text, tinytext, varchar, etc.)
  • date (date, datetime, timestamp, timestampz, etc.)
We can now start creating rules based on data types:
  • Maximum value (data types: numeric, date)
    SELECT MAX(<column>) FROM <object>
  • Minimum value (data types: numeric, date)
    SELECT MIN(<column>) FROM <object>
  • Average value (data types: numeric)
    SELECT AVG(<column>) FROM <object>
  • SUM of values (data types: numeric)
    SELECT SUM(<column>) FROM <object>
  • Average length (data types: string)
    SELECT AVG(LENGTH(<column>) FROM <object>
  • Maximum length (data types: string)
    SELECT MAX(LENGTH(<column>) FROM <object>
  • Minimum length (data types: string)
    SELECT MIN(LENGTH((<column>)) FROM <object>
  • Count of NULLs (data types: numeric, string, date)
    SELECT COUNT(*) FROM <object> WHERE <column> IS NULL
  • Count of NOT NULLs (data types: numeric, string, date)
    SELECT COUNT(*) FROM <object> WHERE <column> IS NOT NULL
  • Count of unique values (data types: numeric, string, date)
    SELECT COUNT(DISTINCT <column>) FROM <object>
Assigning rules to a column should be automated in a way to only assign rules where the column’s data type matches the rule’s data type. For an example, a column with datatype “int” would have rules:

Executing a rule over time creates a pattern, which allows us to detect any sudden changes in the data. For example, if the count of unique values in a column suddenly increases over the set threshold of 5%:

Profiling column values

Profiling the actual values in a column allows us to see how the values change over time and detect anomalies in column entries. A way to do it is to count rows for the top entries in the column and save them as time series data. We can then analyse each value over time using simple linear regression to detect sudden changes. Setting an error threshold allows analysing fluctuating data by triggering alerts if the change is too great.

Profiling objects

Data profiling should be done at object level, allowing to profile selected columns and automatically assign and execute rules for them. This also allows us to analyse changes at object level, for example, row count or count of duplicate rows. All rules related to an object creates a package which can then be scheduled to be executed and alert responsible users on errors.

22. June 2022