Add layers
This page shows the available DQ layers of a DQ job and how to configure them. Select from one of the options below depending on whether you are using Pullup or Pushdown mode.
- Pullup
- Pushdown
- AdaptiveRules
- Shapes
- Outliers
- Patterns
- Dupes
About AdaptiveRules
AdaptiveRules are data quality checks that use machine learning to automatically observe and adapt to changes in numeric representations of data over time and downscore any values outside defined boundaries. When they occur, observations from AdaptiveRules are recorded under the Behavior tab of the Findings page. If you do not want AdaptiveRules findings to appear on the Findings page, deselect all AdaptiveRules options in this step.
Configuring AdaptiveRules
The following table shows details about Collibra DQ's AdaptiveRules, their sub-types and how to configure them.
Tip You can prevent Collibra DQ from including specific behavioral checks by deselecting each AdaptiveRule option you wish to exclude. This does not turn off dataset profiling.
AdaptiveRules | Description | Default On |
---|---|---|
Availability | Observe changes to the row count and loading time in your table. | |
Row Count
|
Monitor the row count change in your table. Select this option to include row count change observations in your DQ check. |
Yes |
Loading Time
|
Monitor loading time changes. Select this option to include loading time change observations in your DQ check. |
No |
Distribution |
Observe the number of unique values in a table. |
|
Uniqueness
|
Monitor a column's cardinality within the range of previous DQ jobs. Select this option to include uniqueness observations in your DQ check. |
Yes |
Completeness | Observe columns in your table containing null values or empty fields. | |
Null Values
|
Monitor columns for null values. Select this option to include null value observations in your DQ check. |
Yes |
Empty Values
|
Monitor columns for empty data. Select this option to include empty value observations in your DQ check. |
Yes |
Conformity | Observe columns with values that fall outside of the normal range. | |
Min
|
Monitor columns with min values outside the normal range. Select this option to include min value observations in your DQ check. |
No |
Mean
|
Monitor columns with mean values outside the normal range. Select this option to include mean value observations in your DQ check. |
No |
Max
|
Monitor columns with max values outside the normal range. Select this option to include max value observations in your DQ check. |
No |
Schema | ||
Data Type Check (Integer, String, Date)
|
Monitors columns that shift from one data type to another. | No |
Schema Change (Add, Alter, Delete)
|
Monitors schema evolution, such as columns that are added or dropped from a dataset. | Yes |
About Shapes
Shapes are rare or inconsistent data formats in string columns that Collibra DQ can detect automatically or you can configure manually.
Configuring Shapes
Automatic shape detection
The Shapes layer is set to Auto by default to automatically discover possible shape issues. Auto shape detection depends on the row count of your dataset to determine the percentage an identified shape adheres to the shape format that Collibra DQ identifies as normal. Anything less than that percentage triggers a break. The following table provides a breakdown of occurrence percentage as determined by row count.
Row Count | Percentage |
---|---|
< 100 | 2.0 |
< 1,000 | 1.0 |
< 10,000 | 0.1 |
< 100,000 | 0.05 |
< 1,000,000 | 0.01 |
< 10,000,000 | 0.001 |
< 100,000,000 | 0.0001 |
Other | 0.00001 |
Manually configuring shape detection
Note With the exception of Data Shape Granular, these options are only configurable when the toggle is set to Manual.
- Select the Manual option.
- Select the option next to the column name to include that column in the Shapes layer. You can select as many columns as there are available.
- Optionally click , then click Configuration for additional Shapes configuration options.
The Shapes Configuration modal appears.- Optionally configure Manual Options and Data Shape Granular. The following table shows the available configuration options.
- Click Save. Your configurations save and the Shapes Configuration modal closes.
Option Description Default Value Manual Options Occurrences Only shows shapes below the given percentage threshold.
Drag the slider or enter a value in the input field to set the threshold to a value between 0.001 and 5.
0.001 Format per Column Only shows columns with less than the given number of formats. For example, when there are 30 formats and the Format per Column value is set to 5, then only 5 columns display. 20 Character Length Only shows Shape issues with fewer than the given number of characters. For example, if the value is set to 9, then all values less than 9 will show. 12 Data Shape Granular Data Shape Granular Checks the length of alphanumeric fields and checks for numbers and letters independently.
Auto
Note The Auto setting leaves Data Shape Granular turned off by default. To apply a data shape granular check, select On.
-
Click Save.
Using Outliers
Outliers are values that differ significantly from the rest of the data and may indicate bad or incorrect data. Numerical outliers are detected with the IQR and box plot methods.
Adding Outliers
- From the Outliers tab in the Add Layers workflow, click Add Outlier.
The Outlier editor displays. - Select the required options.
- Click Save.
Column | Description | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
The column(s) used to detect outliers. You can select multiple columns for outlier detection. Required. Select one or more checkboxes next to the name of the columns to include them in Outlier analysis. |
|||||||||||||
Column |
The name of the column in your table or view. Click the Assign chip in a Key column to uniquely identify the rows of the column. Note You can assign only one column as the Key column and it cannot be the column you select for outlier analysis. |
||||||||||||
Type |
The type of data in a given column. This column is autofilled and cannot be edited or selected. The following is a list of the data types:
|
||||||||||||
Key |
The key column whose rows have unique identifiers. This column provides context by grouping each column by the key column. When assigned, outliers are grouped to their assigned column when detected. Optional. Optionally click the Assign chip to uniquely identify the rows of a column. Note You can assign only one column as the Key column and it cannot be the column you select for outlier analysis. |
||||||||||||
Date |
The date or time dimension column(s) that are used to define time bins and processing intervals. Tip When Assigned, you can configure the Lookback in the Outliers Configuration dialog. Note You can assign only one column as the Date column. |
Tip Repeat this process to add more Outliers.
Outliers Configuration
Optionally click the three dots menu icon to configure Quartiles, Lookback, and determine whether or not to detect for Categorical outliers. The following table shows the configuration options and their descriptions:
Option | Description |
---|---|
Quartile | |
Quartile 1
|
The boundaries beyond which low (Q1) outliers are detected using the IQR formula. Use the slider or enter a value in the text box between 0.01 and 0.45. |
Quartile 3
|
The boundaries beyond which high (Q3) outliers are detected using the IQR formula. Use the slider or enter a value in the text box between 0.55 and 0.99. |
Lookback |
The lookback period of your outlier scan. Visit the Lookback documentation to learn more. Note The options in this section are only available when Date is Assigned in the Add Outliers step. |
Lookback Period
|
The period used to look back through the dataset using the date column. Use the slider or enter a value in the text box between 0 and 30. |
Interval
|
The unit of measurement of the lookback period, such as DAY. Select an option from the dropdown menu. |
By
|
Select an option from the dropdown menu. |
Categorical | Select this option to detect for categorical outliers on specific string type columns. |
Advanced Options
Optionally click near Advanced Options to adjust outlier detection sensitivity. The following table shows the available options.
Option | Description |
---|---|
Column |
The name of the column in your table or view. You cannot edit this option. |
Sensitivity |
The outlier detection sensitivity. Click and drag the slider to adjust outlier detection sensitivity from Low to High for a given column. |
Unit |
The unit of outlier detection sensitivity. Select an option from the dropdown menu. |
About Patterns
Patterns detect similarities among string values across columns and downscore any observations.
Configuring Patterns
- From the Patterns tab in the Add Layers workflow, click Add Patterns.
The Patterns editor displays. - Enter the required information.
- Click Save.
- Repeat these steps for as many pattern detection configurations as you want to add.
Option | Description |
---|---|
The column(s) used to detect patterns. You can select multiple columns for pattern detection. Required. Select at least one and no more than 10 checkboxes next to the name of the columns to include them in the pattern analysis. |
|
Column | The name of the column in your table or view. |
Type |
The type of data in a given column. This column is auto-filled and cannot be edited or selected. Collibra DQ supports all standard data types. Collibra DQ also supports some custom data types depending on the data source, for example, Snowflake. |
Key |
The key column whose rows have unique identifiers. This column provides context by grouping each column by the key column. When assigned, patterns are grouped to their assigned column when detected. Optional. Optionally click the Assign chip to uniquely identify the rows of a column. Note You can assign only one column as the Key column and it cannot be the column you select for pattern analysis. |
Date |
The date or time dimension column(s) that are used to define time bins and processing intervals. Tip When Assigned, you can configure the Lookback in the Patterns Configuration dialog. Note You can assign only one column as the Date column. |
Configuring lookback
Optionally click the three dots menu icon to configure Lookback. When you do not configure lookback, the default lookback period is 5 days of data. The following table details the lookback configuration options:
Option | Description |
---|---|
Lookback |
The lookback period of your patterns scan. Visit the Lookback documentation to learn more. Note The options in this section are only available when Date is Assigned in the Add Patterns step. |
Lookback Period
|
The period used to look back through the dataset using the date column. Use the slider or enter a value in the text box between 0 and 30. |
Interval
|
The unit of measurement of the lookback period, such as DAY. Select an option from the dropdown menu. |
About Dupes
Dupes are values that match other existing values in columns and can be set to detect exact and fuzzy matches. Both exact and fuzzy support case-insensitive matching.
Exact matches
Exact matches detect entries that are potentially equivalent despite formatting issues. For example, Collibra DQ flags the column values John Doe, JOHN DOE, and john doe as exact match dupes of each other when Exact Match is on and Case Sensitive is off.
Note If Case Sensitive is on, then the values in the previous example are not equivalent.
Fuzzy matches
Fuzzy matches detect entries that are potentially equivalent despite misspellings, common spelling variations, and more. For example, Collibra DQ flags the column values John Doe, Johnny Doe, and Johhnn Doe as fuzzy match dupes of each other. The fuzzy match algorithm complexity is O² and compares every two fields in the dataset.
Configuring Dupes
- From the Dupes tab in the Add Layers workflow, click Add Dupes.
The Dupes editor displays. - Enter the required information.
- Click Save.
Option | Description |
---|---|
The column(s) used to detect dupes. You can select multiple columns for dupes detection. Required. Select at least one and no more than 10 checkboxes next to the name of the columns to include them in the Dupes analysis. |
|
Column | The name of the column in your table or view. |
Type |
The type of data in a given column. This column is auto-filled and cannot be edited or selected. Collibra DQ supports all standard data types. Collibra DQ also supports some custom data types depending on the data source, for example, Snowflake. |
Case Sensitive |
Detects exact duplicates that are dependent on the casing format of values. When switched on, Collibra DQ includes case sensitivity in the dupes detection and flags findings as exact matches. When switched off, Collibra DQ does not include case sensitivity in the dupes detection and flags findings as exact matches. |
Exact Match |
When switched on, Collibra DQ detects entries that may be potential matches despite formatting issues. Note Exact match findings are case insensitive by default. Ensure that Case Sensitive is switched off for Exact Match to work properly. |
Dupes Score |
The percentage match Collibra DQ will detect as fuzzy matches. Drag the scoring slider to define the match percentage limits. |
- AdaptiveRules
- Shapes
- Outliers
- Patterns
- Dupes
About AdaptiveRules
AdaptiveRules are data quality checks that use machine learning to automatically observe and adapt to changes in numeric representations of data over time and downscore any values outside defined boundaries. When they occur, observations from AdaptiveRules are recorded under the Behavior tab of the Findings page. If you do not want AdaptiveRules findings to appear on the Findings page, deselect all AdaptiveRules options in this step.
Configuring AdaptiveRules
The following table shows details about Collibra DQ's AdaptiveRules, their sub-types and how to configure them.
Tip You can prevent Collibra DQ from including specific behavioral checks by deselecting each AdaptiveRule option you wish to exclude. This does not turn off dataset profiling.
AdaptiveRules | Description | Default On |
---|---|---|
Availability | Observe changes to the row count and loading time in your table. | |
Row Count
|
Monitor the row count change in your table. Select this option to include row count change observations in your DQ check. |
Yes |
Loading Time
|
Monitor loading time changes. Select this option to include loading time change observations in your DQ check. |
No |
Distribution |
Observe the number of unique values in a table. |
|
Uniqueness
|
Monitor a column's cardinality within the range of previous DQ jobs. Select this option to include uniqueness observations in your DQ check. |
Yes |
Completeness | Observe columns in your table containing null values or empty fields. | |
Null Values
|
Monitor columns for null values. Select this option to include null value observations in your DQ check. |
Yes |
Empty Values
|
Monitor columns for empty data. Select this option to include empty value observations in your DQ check. |
Yes |
Conformity | Observe columns with values that fall outside of the normal range. | |
Min
|
Monitor columns with min values outside the normal range. Select this option to include min value observations in your DQ check. |
No |
Mean
|
Monitor columns with mean values outside the normal range. Select this option to include mean value observations in your DQ check. |
No |
Max
|
Monitor columns with max values outside the normal range. Select this option to include max value observations in your DQ check. |
No |
Schema | ||
Data Type Check (Integer, String, Date)
|
Monitors columns that shift from one data type to another. | No |
Schema Change (Add, Alter, Delete)
|
Monitors schema evolution, such as columns that are added or dropped from a dataset. | Yes |
About Shapes
Shapes are rare or inconsistent data formats in string columns that Collibra DQ can detect automatically or you can configure manually.
Configuring Shapes
Automatic shape detection
The Shapes layer is set to Auto by default to automatically discover possible shape issues. Auto shape detection depends on the row count of your dataset to determine the percentage an identified shape adheres to the shape format that Collibra DQ identifies as normal. Anything less than that percentage triggers a break. The following table provides a breakdown of occurrence percentage as determined by row count.
Row Count | Percentage |
---|---|
< 100 | 2.0 |
< 1,000 | 1.0 |
< 10,000 | 0.1 |
< 100,000 | 0.05 |
< 1,000,000 | 0.01 |
< 10,000,000 | 0.001 |
< 100,000,000 | 0.0001 |
Other | 0.00001 |
Manually configuring shape detection
Note With the exception of Data Shape Granular, these options are only configurable when the toggle is set to Manual.
- Select the Manual option.
- Select the option next to the column name to include that column in the Shapes layer. You can select as many columns as there are available.
- Optionally click , then click Configuration for additional Shapes configuration options.
The Shapes Configuration modal appears.- Optionally configure Manual Options and Data Shape Granular. The following table shows the available configuration options.
- Click Save. Your configurations save and the Shapes Configuration modal closes.
Option Description Default Value Manual Options Occurrences Only shows shapes below the given percentage threshold.
Drag the slider or enter a value in the input field to set the threshold to a value between 0.001 and 5.
0.001 Format per Column Only shows columns with less than the given number of formats. For example, when there are 30 formats and the Format per Column value is set to 5, then only 5 columns display. 20 Character Length Only shows Shape issues with fewer than the given number of characters. For example, if the value is set to 9, then all values less than 9 will show. 12 Data Shape Granular Data Shape Granular Checks the length of alphanumeric fields and checks for numbers and letters independently.
Auto
Note The Auto setting leaves Data Shape Granular turned off by default. To apply a data shape granular check, select On.
-
Click Save.
Using Outliers
Outliers are values that differ significantly from the rest of the data and may indicate bad or incorrect data. Numerical outliers are detected with the IQR and box plot methods.
Adding Outliers
- From the Outliers tab in the Add Layers workflow, click Add Outlier.
The Outlier editor displays. - Select the required options.
- Click Save.
Column | Description | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
The column(s) used to detect outliers. You can select multiple columns for outlier detection. Required. Select one or more checkboxes next to the name of the columns to include them in Outlier analysis. |
|||||||||||||
Column |
The name of the column in your table or view. Click the Assign chip in a Key column to uniquely identify the rows of the column. Note You can assign only one column as the Key column and it cannot be the column you select for outlier analysis. |
||||||||||||
Type |
The type of data in a given column. This column is autofilled and cannot be edited or selected. The following is a list of the data types:
|
||||||||||||
Key |
The key column whose rows have unique identifiers. This column provides context by grouping each column by the key column. When assigned, outliers are grouped to their assigned column when detected. Optional. Optionally click the Assign chip to uniquely identify the rows of a column. Note You can assign only one column as the Key column and it cannot be the column you select for outlier analysis. |
||||||||||||
Date |
The date or time dimension column(s) that are used to define time bins and processing intervals. Tip When Assigned, you can configure the Lookback in the Outliers Configuration dialog. Note You can assign only one column as the Date column. |
Tip Repeat this process to add more Outliers.
Outliers Configuration
Optionally click the three dots menu icon to configure Quartiles, Lookback, and determine whether or not to detect for Categorical outliers. The following table shows the configuration options and their descriptions:
Option | Description |
---|---|
Quartile | |
Quartile 1
|
The boundaries beyond which low (Q1) outliers are detected using the IQR formula. Use the slider or enter a value in the text box between 0.01 and 0.45. |
Quartile 3
|
The boundaries beyond which high (Q3) outliers are detected using the IQR formula. Use the slider or enter a value in the text box between 0.55 and 0.99. |
Lookback |
The lookback period of your outlier scan. Visit the Lookback documentation to learn more. Note The options in this section are only available when Date is Assigned in the Add Outliers step. |
Lookback Period
|
The period used to look back through the dataset using the date column. Use the slider or enter a value in the text box between 0 and 30. |
Interval
|
The unit of measurement of the lookback period, such as DAY. Select an option from the dropdown menu. |
By
|
Select an option from the dropdown menu. |
Categorical | Select this option to detect for categorical outliers on specific string type columns. |
Advanced Options
Optionally click near Advanced Options to adjust outlier detection sensitivity. The following table shows the available options.
Option | Description |
---|---|
Column |
The name of the column in your table or view. You cannot edit this option. |
Sensitivity |
The outlier detection sensitivity. Click and drag the slider to adjust outlier detection sensitivity from Low to High for a given column. |
Unit |
The unit of outlier detection sensitivity. Select an option from the dropdown menu. |
Warning Patterns are not yet available for Pushdown. Patterns will be available in an upcoming Collibra DQ release.
About Patterns
Patterns detect similarities among string values across columns and downscore any observations.
About Dupes
Dupes are values that match other existing values in columns and can be set to detect exact and fuzzy matches. Both exact and fuzzy support case-insensitive matching.
Exact matches
Exact matches detect entries that are potentially equivalent despite formatting issues. For example, Collibra DQ flags the column values John Doe, JOHN DOE, and john doe as exact match dupes of each other when Exact Match is on and Case Sensitive is off.
Note If Case Sensitive is on, then the values in the previous example are not equivalent.
Fuzzy matches
Fuzzy matches detect entries that are potentially equivalent despite misspellings, common spelling variations, and more. For example, Collibra DQ flags the column values John Doe, Johnny Doe, and Johhnn Doe as fuzzy match dupes of each other. The fuzzy match algorithm complexity is O² and compares every two fields in the dataset.
Configuring Dupes
- From the Dupes tab in the Add Layers workflow, click Add Dupes.
The Dupes editor displays. - Enter the required information.
- Click Save.
Option | Description |
---|---|
The column(s) used to detect dupes. You can select multiple columns for dupes detection. Required. Select at least one and no more than 10 checkboxes next to the name of the columns to include them in the Dupes analysis. |
|
Column | The name of the column in your table or view. |
Type |
The type of data in a given column. This column is auto-filled and cannot be edited or selected. Collibra DQ supports all standard data types. Collibra DQ also supports some custom data types depending on the data source, for example, Snowflake. |
Case Sensitive |
Detects exact duplicates that are dependent on the casing format of values. When switched on, Collibra DQ includes case sensitivity in the dupes detection and flags findings as exact matches. When switched off, Collibra DQ does not include case sensitivity in the dupes detection and flags findings as exact matches. |
Exact Match |
When switched on, Collibra DQ detects entries that may be potential matches despite formatting issues. Note Exact match findings are case insensitive by default. Ensure that Case Sensitive is switched off for Exact Match to work properly. |
Dupes Score |
The percentage match Collibra DQ will detect as fuzzy matches. Drag the scoring slider to define the match percentage limits. |