Add layers

This page shows the available DQ layers of a DQ job and how to configure them. Select from one of the options below.

AdaptiveRules
Shapes
Outliers
Patterns
Dupes

About Adaptive Rules

Adaptive Rules are data quality checks that use machine learning to automatically observe and adapt to changes in numeric representations of data over time and downscore any values outside defined boundaries. When they occur, observations from Adaptive Rules are recorded under the Behavior tab of the Findings page. If you do not want Adaptive Rules findings to appear on the Findings page, deselect all Adaptive Rules options in this step.

Configuring Adaptive Rules

The following table shows details about Collibra DQ's Adaptive Rules, their sub-types and how to configure them.

Tip You can prevent Collibra DQ from including specific behavioral checks by deselecting each AdaptiveRule option you wish to exclude. This does not turn off dataset profiling.

Adaptive Rules	Description	Default On
Availability	Observe changes to the row count and loading time in your table.
Row Count	Monitor the row count change in your table. Select this option to include row count change observations in your DQ check.	Yes
Loading Time	Monitor loading time changes. Select this option to include loading time change observations in your DQ check.	No
Distribution	Observe the number of unique values in a table.
Uniqueness	Monitor a column's cardinality within the range of previous DQ jobs. Select this option to include uniqueness observations in your DQ check.	Yes
Completeness	Observe columns in your table containing null values or empty fields.
Null Values	Monitor columns for null values. Select this option to include null value observations in your DQ check.	Yes
Empty Values	Monitor columns for empty data. Select this option to include empty value observations in your DQ check.	Yes
Conformity	Observe columns with values that fall outside of the normal range.
Min	Monitor columns with min values outside the normal range. Select this option to include min value observations in your DQ check.	No
Mean	Monitor columns with mean values outside the normal range. Select this option to include mean value observations in your DQ check.	No
Max	Monitor columns with max values outside the normal range. Select this option to include max value observations in your DQ check.	No
Schema	Schema checks are on by default and cannot be configured.
Data Type Check (Integer, String, Date)	Monitors columns that shift from one data type to another.	Yes
Schema Change (Add, Alter, Delete)	Monitors schema evolution, such as columns that are added or dropped from a dataset.	Yes

About Shapes

Shapes are rare or inconsistent data formats in string columns that Collibra DQ can detect automatically or you can configure manually.

Configuring Shapes

Automatic shape detection

The Shapes layer is set to Auto by default to automatically discover possible shape issues. Auto shape detection depends on the row count of your dataset to determine the percentage an identified shape adheres to the shape format that Collibra DQ identifies as normal. Anything less than that percentage triggers a break. The following table provides a breakdown of occurrence percentage as determined by row count.

Row Count	Percentage
< 100	2.0
< 1,000	1.0
< 10,000	0.1
< 100,000	0.05
< 1,000,000	0.01
< 10,000,000	0.001
< 100,000,000	0.0001
Other	0.00001

Manually configuring shape detection

Note With the exception of Data Shape Granular, these options are only configurable when the toggle is set to Manual.

Select the Manual option.
Select the option next to the column name to include that column in the Shapes layer.

You can select as many columns as there are available.

Optionally click

, then click

Configuration for additional Shapes configuration options.
The Shapes Configuration modal appears.

Optionally configure Manual Options and Data Shape Granular. The following table shows the available configuration options.

Option	Description	Default Value
Manual Options
Occurrences	Only shows shapes below the given percentage threshold. Drag the slider or enter a value in the input field to set the threshold to a value between 0.001 and 5.	0.001
Format per Column	Only shows columns with less than the given number of formats. For example, when there are 30 formats and the Format per Column value is set to 5, then only 5 columns display.	20
Character Length	Only shows Shape issues with fewer than the given number of characters. For example, if the value is set to 9, then all values less than 9 will show.	12
Data Shape Granular
Data Shape Granular	Checks the length of alphanumeric fields and checks for numbers and letters independently.	Auto Note The Auto setting leaves Data Shape Granular turned off by default. To apply a data shape granular check, select On.

Click Save.

Your configurations save and the Shapes Configuration modal closes.

Click Save.

Using Outliers

Outliers are values that differ significantly from the rest of the data and may indicate bad or incorrect data. Numerical outliers are detected with the IQR and box plot methods.

Adding Outliers

From the Outliers tab in the Add Layers workflow, click Add Outlier.
The Outlier editor displays.
Select the required options.

Column

Description

The column(s) used to detect outliers. You can select multiple columns for outlier detection. Required.

Select one or more checkboxes next to the name of the columns to include them in Outlier analysis.

Column

The name of the column in your table or view.

Click the Assign chip in a Key column to uniquely identify the rows of the column.

Note You can assign only one column as the Key column and it cannot be the column you select for outlier analysis.

Type

The type of data in a given column. This column is autofilled and cannot be edited or selected.

The following is a list of the data types:

Data type	Description
Date	A date.
DOUBLE	A normal-size floating point number.
VARCHAR	A VARIABLE length string that can contain letters, numbers, and special characters.
text	A variable width character string.
numeric	Fixed precision and scale numbers.

Key

The key column whose rows have unique identifiers. This column provides context by grouping each column by the key column. When assigned, outliers are grouped to their assigned column when detected. Optional.

Optionally click the Assign chip to uniquely identify the rows of a column.

Note You can assign only one column as the Key column and it cannot be the column you select for outlier analysis.

Date

The date or time dimension column(s) that are used to define time bins and processing intervals.

Tip When Assigned, you can configure the Lookback in the Outliers Configuration dialog.

Note You can assign only one column as the Date column.

Click Save.

Tip Repeat this process to add more Outliers.

Outliers Configuration

Optionally, click to configure Quartiles, Lookback, and determine whether or not to detect for Categorical outliers. The following table shows the configuration options and their descriptions:

Option	Description
Quartile
Quartile 1	The boundaries beyond which low (Q1) outliers are detected using the IQR formula. Use the slider or enter a value in the text box between 0.01 and 0.45.
Quartile 3	The boundaries beyond which high (Q3) outliers are detected using the IQR formula. Use the slider or enter a value in the text box between 0.55 and 0.99.
Lookback	The lookback period of your outlier scan. Visit the Lookback documentation to learn more. Note The options in this section are only available when Date is Assigned in the Add Outliers step.
Lookback Period	The period used to look back through the dataset using the date column. Use the slider or enter a value in the text box between 0 and 30.
Interval	The unit of measurement of the lookback period, such as DAY. Select an option from the dropdown menu.
By	Select an option from the dropdown menu.
Categorical	Select this option to detect for categorical outliers on specific string type columns.

Advanced Options

Optionally click near Advanced Options to adjust outlier detection sensitivity. The following table shows the available options.

Option	Description
Column	The name of the column in your table or view. You cannot edit this option.
Sensitivity	The outlier detection sensitivity. Click and drag the slider to adjust outlier detection sensitivity from Low to High for a given column.
Unit	The unit of outlier detection sensitivity. Select an option from the dropdown menu.

Option

Description

Column

The name of the column in your table or view. You cannot edit this option.

Sensitivity

The outlier detection sensitivity.

Click and drag the slider to adjust outlier detection sensitivity from Low to High for a given column.

Unit

The unit of outlier detection sensitivity.

Select an option from the dropdown menu.

Warning Patterns are not yet available in Pushdown mode.

About Patterns

Patterns detect similarities among string values across columns and downscore any observations.

Configuring Patterns

From the Patterns tab in the Add Layers workflow, click Add Patterns.
The Patterns editor displays.
Enter the required information.

Option	Description
	The column(s) used to detect patterns. You can select multiple columns for pattern detection. Required. Select at least one and no more than 10 checkboxes next to the name of the columns to include them in the pattern analysis.
Column	The name of the column in your table or view.
Type	The type of data in a given column. This column is auto-filled and cannot be edited or selected. Collibra DQ supports all standard data types. Collibra DQ also supports some custom data types depending on the data source, for example, Snowflake.
Key	The key column whose rows have unique identifiers. This column provides context by grouping each column by the key column. When assigned, patterns are grouped to their assigned column when detected. Optional. Optionally click the Assign chip to uniquely identify the rows of a column. Note You can assign only one column as the Key column and it cannot be the column you select for pattern analysis.
Date	The date or time dimension column(s) that are used to define time bins and processing intervals. Tip When Assigned, you can configure the Lookback in the Patterns Configuration dialog. Note You can assign only one column as the Date column.

Click Save.
Repeat these steps for as many pattern detection configurations as you want to add.

Configuring lookback

Optionally, click to configure Lookback. When you do not configure lookback, the default lookback period is 5 days of data. The following table details the lookback configuration options:

Option	Description
Lookback	The lookback period of your patterns scan. Visit the Lookback documentation to learn more. Note The options in this section are only available when Date is Assigned in the Add Patterns step.
Lookback Period	The period used to look back through the dataset using the date column. Use the slider or enter a value in the text box between 0 and 30.
Interval	The unit of measurement of the lookback period, such as DAY. Select an option from the dropdown menu.

Option

Description

Lookback

The lookback period of your patterns scan. Visit the Lookback documentation to learn more.

Note The options in this section are only available when Date is Assigned in the Add Patterns step.

Lookback Period

The period used to look back through the dataset using the date column.

Use the slider or enter a value in the text box between 0 and 30.

Interval

The unit of measurement of the lookback period, such as DAY.

Select an option from the dropdown menu.

About Dupes

Dupes are values that match other existing values in columns and can be set to detect exact and fuzzy matches. Both exact and fuzzy support case-insensitive matching.

Exact matches

Exact matches detect entries that are potentially equivalent despite formatting issues. For example, Collibra DQ flags the column values John Doe, JOHN DOE, and john doe as exact match dupes of each other when Exact Match is on and Case Sensitive is off.

Note If Case Sensitive is on, then the values in the previous example are not equivalent.

Fuzzy matches

Fuzzy matches detect entries that are potentially equivalent despite misspellings, common spelling variations, and more. For example, Collibra DQ flags the column values John Doe, Johnny Doe, and Johhnn Doe as fuzzy match dupes of each other. The fuzzy match algorithm complexity is O² and compares every two fields in the dataset.

Configuring Dupes

From the Dupes tab in the Add Layers workflow, click Add Dupes.
The Dupes editor displays.
Enter the required information.

Option	Description
	Required. The column(s) used to detect dupes. You can select multiple columns for dupes detection. Select at least one and no more than 10 checkboxes next to the name of the columns to include them in the Dupes analysis.
Column	The name of the column in your table or view.
Type	The type of data in a given column. This column is auto-filled and cannot be edited or selected. Collibra DQ supports all standard data types. Collibra DQ also supports some custom data types depending on the data source, for example, Snowflake.
Case Sensitive	Detects exact duplicates that are dependent on the casing format of values. When switched on, Collibra DQ includes case sensitivity in the dupes detection and flags findings as exact matches. When switched off, Collibra DQ does not include case sensitivity in the dupes detection and flags findings as exact matches.
Exact Match	When switched on, Collibra DQ detects entries that may be potential matches despite formatting issues. Note Exact match findings are case insensitive by default. Ensure that Case Sensitive is switched off for Exact Match to work properly.
Dupes Score	The percentage match Collibra DQ will detect as fuzzy matches. Drag the scoring slider to define the match percentage limits.

Click Save.