Data profiling

Data profiling

Profiling data is a great way to find anomalies and detect sudden changes in data. It creates a useful summary which helps to discover various data quality issues and find deviations. Collecting profiled results allows us to analyze them with each new entry and flag values for possible errors as soon as they appear in the data. This makes it possible to find anomalies without having any domain knowledge or expectations about the data.

Profiling vs testing the data

The main difference between profiling and testing the data is that testing requires defined expectations about the data, but with profiling the expectations are created based on the data history. By testing, we can assume that a column has some values that are NOT NULL (hence the expectation of a result COUNT(*) > 0). By profiling, we can analyze how the count of NOT NULL’s changes over time and alert responsible users when the change is not as expected.
For example, we can see the trend as the orange line, calculated using linear regression, predicting the next result to be ~19.8m, but the actual result is ~16.9m. This is an error of 2.9m rows and indicates there are far less NOT NULL values in the column as there should be.

Sample rules

To start profiling the data, we should first define column based rules that allow us to collect data about the column. The rules should be based on data types, as certain rules can be assigned to numeric data, but can’t be assigned to strings. For example, we can create three groups:
  • numeric (bigint, decimal, double, float, int, integer, money, numeric, real, etc.)
  • string (char, character, name, string, text, tinytext, varchar, etc.)
  • date (date, datetime, timestamp, timestampz, etc.)
Now we can start creating rules based on data types:
  • Maximum value (data types: numeric, date)
    SELECT MAX(<column>) FROM <object>
  • Minimum value (data types: numeric, date)
    SELECT MIN(<column>) FROM <object>
  • Average value (data types: numeric)
    SELECT AVG(<column>) FROM <object>
  • SUM of values (data types: numeric)
    SELECT SUM(<column>) FROM <object>
  • Maximum length (data types: string)
    SELECT MAX(LENGTH(<column>) FROM <object>
  • Minimum length (data types: string)
    SELECT MIN(LENGTH((<column>)) FROM <object>
  • Count of NULLs (data types: numeric, string, date)
    SELECT COUNT(*) FROM <object> WHERE <column> IS NULL
  • Count of NOT NULLs (data types: numeric, string, date)
    SELECT COUNT(*) FROM <object> WHERE <column> IS NOT NULL
  • Count of unique values (data types: numeric, string)
    SELECT COUNT(DISTINCT <column>) FROM <object>
Assigning rules to a column should be automated in a way to only assign rules where the column’s data type matches the rule’s data type. For an example, a column with datatype “int” would have rules:

Profiling column values

Profiling the actual values in a column allows us to see how the data changes over time and detect anomalies in column entries. A great way to do it is to count rows for top entries in the column and save them as time series data. We can then analyze each value over time using simple linear regression to detect sudden changes. Setting an error threshold allows analyzing fluctuating data by triggering alerts if the change is too great.

Profiling objects

08. January 2022