Stat Rules
A powerful technique is to access profile statistics in your rules. These are typically sub-second operations that do not require scanning or iterating. There are several cases where SQL struggles to support rules, such as: isNull but not "null count" or nullRatio or nullPercent. Or having access to types without doing crazy cast() operations. These are simplified below, for example fname.$type != 'String'
.
select * from @dataset where
fname.$type != 'String' AND $rowCount < 800
Dataset-level stat rules
Column-level stat rules
Column-level Stat Rule | Rule Example | Description |
---|---|---|
.$type | fname.$type != 'String' | Alert when fname is not a string. |
.$min | fname.$min > 'apple' | Lexicographical sort works for strings and numbers. |
.$minNum | age.$minNum > 13 |
Type casted to a numeric for simple number checks. |
.$mean | row_id.$mean > '4.500' | Alert when the mean is greater than a given value. |
.$max | fname.$max > 'apple' | Alert when the max is greater than a given value. |
.$maxNum | age.$maxNum > 13 | Alert when the numeric value falls outside an acceptable range. |
.$uniqueCount | id.$uniqueCount != $rowCount | Alert when the uniqueCount of a field doesn't match the rowCount. |
.$uniqueRatio | gender.$uniqueRatio between .4 and .6 | Alert when the ratio of uniqueCounts of a given field doesn't match the rowCount. |
.$nullRatio | lname.$nullRatio not between .4 and .6 | Alert when the ratio of nulls no longer falls within acceptable range. |
.$nullPercent | lname.$nullPercent not between 40 and 60 | Alert when the percent of nulls no longer falls within acceptable range |
.$nullCount | lname.$nullCount >= 1 | Test for a single null. |
.$emptyRatio | nc.$emptyRatio > 0.2 | Alert when the ratio of empties no longer falls within acceptable range. |
.$emptyPercent | nc.$emptyPercent > 20 | Alert when the percent of empties no longer falls within an acceptable range. |
.$emptyCount | Alert when the emptyCounts of a field no longer fall within an acceptable range. | |
.$mixedTypeRatio | nc.$mixedTypeRatio > 0.2 | Alert when the ratio of mixed data types no longer falls within an acceptable range. For example, Strings and Ints in the same field. |
.$mixedTypePercent | nc.$mixedTypeRatio > 20 |
Alert when the percent of mixed data types no longer falls within an acceptable range. |
.$mixedTypeCount | id.$mixedTypeCount >= 1 | Alerts when the mixed data typeCount no longer falls within an acceptable range. For example, Strings and Ints in the same field. |
Distribution Rule
There is a common case in DQ where you want to know the distribution of a column's value. Consider gender. It can be expected that a column named gender consists of roughly 40-60% males and roughly 40-60% females if the data set is large and represents the population. This can be difficult to express in plain SQL, but it is very easy with the below syntax.
gender['Male'].$uniquePercent between 40 and 60
Column Value-level | Rule |
---|---|
.$uniqueCount | credit_rating['FAIR'].$uniqueCount > 7 |
.$uniquePercent | credit_rating['GOOD'].uniquePercent between 40 and 60 |
Known Limitations
- The $daysWithoutData stat rule has a limitation where rules that are not named "daysWithoutData" do not display their Pulse View charts when you drill into the rule on the Findings page. A possible workaround for this is to name your stat rule exactly as it appears here: daysWithoutData.
- You cannot combine stat rules or distribution rules with regex rules in the same rule. For example,
car_vin rlike '$[asdf][0-9]' and car_vin.$uniqueCount