Add layers

This page shows the available DQ layers of a DQ job and how to configure them. Select from one of the options below depending on whether you are using Pullup or Pushdown mode.

About AdaptiveRules

AdaptiveRules are data quality checks that use machine learning to automatically observe and adapt to changes in numeric representations of data over time and downscore any values outside defined boundaries. When they occur, observations from AdaptiveRules are recorded under the Behavior tab of the Findings page. If you do not want AdaptiveRules findings to appear on the Findings page, deselect all AdaptiveRules options in this step.

Configuring AdaptiveRules

The following table shows details about Collibra DQ's AdaptiveRules, their sub-types and how to configure them.

Tip You can prevent Collibra DQ from including specific behavioral checks by deselecting each AdaptiveRule option you wish to exclude. This does not turn off dataset profiling.

AdaptiveRules Description Default On
Availability Observe changes to the row count and loading time in your table.
Row Count

Monitor the row count change in your table.

Select this option to include row count change observations in your DQ check.

Yes
Loading Time

Monitor loading time changes.

Select this option to include loading time change observations in your DQ check.

No
Distribution

Observe the number of unique values in a table.

Uniqueness

Monitor a column's cardinality within the range of previous DQ jobs.

Select this option to include uniqueness observations in your DQ check.

Yes
Completeness Observe columns in your table containing null values or empty fields.
Null Values

Monitor columns for null values.

Select this option to include null value observations in your DQ check.

Yes
Empty Values

Monitor columns for empty data.

Select this option to include empty value observations in your DQ check.

Yes
Conformity Observe columns with values that fall outside of the normal range.
Min

Monitor columns with min values outside the normal range.

Select this option to include min value observations in your DQ check.

No
Mean

Monitor columns with mean values outside the normal range.

Select this option to include mean value observations in your DQ check.

No
Max

Monitor columns with max values outside the normal range.

Select this option to include max value observations in your DQ check.

No
Schema  
Data Type Check (Integer, String, Date)
Monitors columns that shift from one data type to another. No
Schema Change (Add, Alter, Delete)
Monitors schema evolution, such as columns that are added or dropped from a dataset. Yes

About Shapes

Shapes are rare or inconsistent data formats in string columns that Collibra DQ can detect automatically or you can configure manually.

Configuring Shapes

Automatic shape detection

The Shapes layer is set to Auto by default to automatically discover possible shape issues. Auto shape detection depends on the row count of your dataset to determine the percentage an identified shape adheres to the shape format that Collibra DQ identifies as normal. Anything less than that percentage triggers a break. The following table provides a breakdown of occurrence percentage as determined by row count.

Row Count Percentage
< 100 2.0
< 1,000 1.0
< 10,000 0.1
< 100,000 0.05
< 1,000,000 0.01
< 10,000,000 0.001
< 100,000,000 0.0001
Other 0.00001

Manually configuring shape detection

Note With the exception of Data Shape Granular, these options are only configurable when the toggle is set to Manual.

  1. Select the Manual option.
  2. Select the option next to the column name to include that column in the Shapes layer.
  3. You can select as many columns as there are available.
  4. Optionally click , then click Configuration for additional Shapes configuration options.
    The Shapes Configuration modal appears.
    1. Optionally configure Manual Options and Data Shape Granular. The following table shows the available configuration options.
    2. OptionDescriptionDefault Value
      Manual Options
      Occurrences

      Only shows shapes below the given percentage threshold.

      Drag the slider or enter a value in the input field to set the threshold to a value between 0.001 and 5.

      0.001
      Format per ColumnOnly shows columns with less than the given number of formats. For example, when there are 30 formats and the Format per Column value is set to 5, then only 5 columns display.20
      Character LengthOnly shows Shape issues with fewer than the given number of characters. For example, if the value is set to 9, then all values less than 9 will show.12
      Data Shape Granular
      Data Shape Granular

      Checks the length of alphanumeric fields and checks for numbers and letters independently.

      Auto

      Note The Auto setting leaves Data Shape Granular turned off by default. To apply a data shape granular check, select On.

    3. Click Save.
    4. Your configurations save and the Shapes Configuration modal closes.
  5. Click Save.

Using Outliers

Outliers are values that differ significantly from the rest of the data and may indicate bad or incorrect data. Numerical outliers are detected with the IQR and box plot methods.

Adding Outliers

  1. From the Outliers tab in the Add Layers workflow, click Add Outlier.
    The Outlier editor displays.
  2. Select the required options.
  3. Column Description
    checkbox

    The column(s) used to detect outliers. You can select multiple columns for outlier detection. Required.

    Select one or more checkboxes next to the name of the columns to include them in Outlier analysis.

    Column

    The name of the column in your table or view.

    Click the Assign chip in a Key column to uniquely identify the rows of the column.

    Note You can assign only one column as the Key column and it cannot be the column you select for outlier analysis.

    Type

    The type of data in a given column. This column is autofilled and cannot be edited or selected.

    The following is a list of the data types:

    Data type Description
    Date A date.
    DOUBLE A normal-size floating point number.
    VARCHAR A VARIABLE length string that can contain letters, numbers, and special characters.
    text A variable width character string.
    numeric Fixed precision and scale numbers.
    Key

    The key column whose rows have unique identifiers. This column provides context by grouping each column by the key column. When assigned, outliers are grouped to their assigned column when detected. Optional.

    Optionally click the Assign chip to uniquely identify the rows of a column.

    Note You can assign only one column as the Key column and it cannot be the column you select for outlier analysis.

    Date

    The date or time dimension column(s) that are used to define time bins and processing intervals.

    Tip When Assigned, you can configure the Lookback in the Outliers Configuration dialog.

    Note You can assign only one column as the Date column.

  4. Click Save.

Tip Repeat this process to add more Outliers.

Outliers Configuration

Optionally click the three dots menu icon to configure Quartiles, Lookback, and determine whether or not to detect for Categorical outliers. The following table shows the configuration options and their descriptions:

Option Description
Quartile  
Quartile 1
The boundaries beyond which low (Q1) outliers are detected using the IQR formula. Use the slider or enter a value in the text box between 0.01 and 0.45.
Quartile 3
The boundaries beyond which high (Q3) outliers are detected using the IQR formula. Use the slider or enter a value in the text box between 0.55 and 0.99.
Lookback

The lookback period of your outlier scan. Visit the Lookback documentation to learn more.

Note The options in this section are only available when Date is Assigned in the Add Outliers step.

Lookback Period

The period used to look back through the dataset using the date column.

Use the slider or enter a value in the text box between 0 and 30.

Interval

The unit of measurement of the lookback period, such as DAY.

Select an option from the dropdown menu.

By
Select an option from the dropdown menu.
Categorical Select this option to detect for categorical outliers on specific string type columns.

Advanced Options

Optionally click near Advanced Options to adjust outlier detection sensitivity. The following table shows the available options.

Option Description
Column

The name of the column in your table or view. You cannot edit this option.

Sensitivity

The outlier detection sensitivity.

Click and drag the slider to adjust outlier detection sensitivity from Low to High for a given column.

Unit

The unit of outlier detection sensitivity.

Select an option from the dropdown menu.

About Patterns

Patterns detect similarities among string values across columns and downscore any observations.

Configuring Patterns

  1. From the Patterns tab in the Add Layers workflow, click Add Patterns.
    The Patterns editor displays.
  2. Enter the required information.
  3. Option Description
    checkbox

    The column(s) used to detect patterns. You can select multiple columns for pattern detection. Required.

    Select at least one and no more than 10 checkboxes next to the name of the columns to include them in the pattern analysis.

    Column The name of the column in your table or view.
    Type

    The type of data in a given column. This column is auto-filled and cannot be edited or selected.

    Collibra DQ supports all standard data types. Collibra DQ also supports some custom data types depending on the data source, for example, Snowflake.

    Key

    The key column whose rows have unique identifiers. This column provides context by grouping each column by the key column. When assigned, patterns are grouped to their assigned column when detected. Optional.

    Optionally click the Assign chip to uniquely identify the rows of a column.

    Note You can assign only one column as the Key column and it cannot be the column you select for pattern analysis.

    Date

    The date or time dimension column(s) that are used to define time bins and processing intervals.

    Tip When Assigned, you can configure the Lookback in the Patterns Configuration dialog.

    Note You can assign only one column as the Date column.

  4. Click Save.
  5. Repeat these steps for as many pattern detection configurations as you want to add.

Configuring lookback

Optionally click the three dots menu icon to configure Lookback. When you do not configure lookback, the default lookback period is 5 days of data. The following table details the lookback configuration options:

Option Description
Lookback

The lookback period of your patterns scan. Visit the Lookback documentation to learn more.

Note The options in this section are only available when Date is Assigned in the Add Patterns step.

Lookback Period

The period used to look back through the dataset using the date column.

Use the slider or enter a value in the text box between 0 and 30.

Interval

The unit of measurement of the lookback period, such as DAY.

Select an option from the dropdown menu.

About Dupes

Dupes are values that match other existing values in columns and can be set to detect exact and fuzzy matches. Both exact and fuzzy support case-insensitive matching.

Exact matches

Exact matches detect entries that are potentially equivalent despite formatting issues. For example, Collibra DQ flags the column values John Doe, JOHN DOE, and john doe as exact match dupes of each other when Exact Match is on and Case Sensitive is off.

Note If Case Sensitive is on, then the values in the previous example are not equivalent.

Fuzzy matches

Fuzzy matches detect entries that are potentially equivalent despite misspellings, common spelling variations, and more. For example, Collibra DQ flags the column values John Doe, Johnny Doe, and Johhnn Doe as fuzzy match dupes of each other. The fuzzy match algorithm complexity is O² and compares every two fields in the dataset.

Configuring Dupes

  1. From the Dupes tab in the Add Layers workflow, click Add Dupes.
    The Dupes editor displays.
  2. Enter the required information.
  3. Option Description
    checkbox

    The column(s) used to detect dupes. You can select multiple columns for dupes detection. Required.

    Select at least one and no more than 10 checkboxes next to the name of the columns to include them in the Dupes analysis.

    Column The name of the column in your table or view.
    Type

    The type of data in a given column. This column is auto-filled and cannot be edited or selected.

    Collibra DQ supports all standard data types. Collibra DQ also supports some custom data types depending on the data source, for example, Snowflake.

    Case Sensitive

    Detects exact duplicates that are dependent on the casing format of values.

    When switched on, Collibra DQ includes case sensitivity in the dupes detection and flags findings as exact matches.

    When switched off, Collibra DQ does not include case sensitivity in the dupes detection and flags findings as exact matches.

    Exact Match

    When switched on, Collibra DQ detects entries that may be potential matches despite formatting issues.

    Note Exact match findings are case insensitive by default. Ensure that Case Sensitive is switched off for Exact Match to work properly.

    Dupes Score

    The percentage match Collibra DQ will detect as fuzzy matches.

    Drag the scoring slider to define the match percentage limits.

  4. Click Save.

About AdaptiveRules

AdaptiveRules are data quality checks that use machine learning to automatically observe and adapt to changes in numeric representations of data over time and downscore any values outside defined boundaries. When they occur, observations from AdaptiveRules are recorded under the Behavior tab of the Findings page. If you do not want AdaptiveRules findings to appear on the Findings page, deselect all AdaptiveRules options in this step.

Configuring AdaptiveRules

The following table shows details about Collibra DQ's AdaptiveRules, their sub-types and how to configure them.

Tip You can prevent Collibra DQ from including specific behavioral checks by deselecting each AdaptiveRule option you wish to exclude. This does not turn off dataset profiling.

AdaptiveRules Description Default On
Availability Observe changes to the row count and loading time in your table.
Row Count

Monitor the row count change in your table.

Select this option to include row count change observations in your DQ check.

Yes
Loading Time

Monitor loading time changes.

Select this option to include loading time change observations in your DQ check.

No
Distribution

Observe the number of unique values in a table.

Uniqueness

Monitor a column's cardinality within the range of previous DQ jobs.

Select this option to include uniqueness observations in your DQ check.

Yes
Completeness Observe columns in your table containing null values or empty fields.
Null Values

Monitor columns for null values.

Select this option to include null value observations in your DQ check.

Yes
Empty Values

Monitor columns for empty data.

Select this option to include empty value observations in your DQ check.

Yes
Conformity Observe columns with values that fall outside of the normal range.
Min

Monitor columns with min values outside the normal range.

Select this option to include min value observations in your DQ check.

No
Mean

Monitor columns with mean values outside the normal range.

Select this option to include mean value observations in your DQ check.

No
Max

Monitor columns with max values outside the normal range.

Select this option to include max value observations in your DQ check.

No
Schema  
Data Type Check (Integer, String, Date)
Monitors columns that shift from one data type to another. No
Schema Change (Add, Alter, Delete)
Monitors schema evolution, such as columns that are added or dropped from a dataset. Yes

About Shapes

Shapes are rare or inconsistent data formats in string columns that Collibra DQ can detect automatically or you can configure manually.

Configuring Shapes

Automatic shape detection

The Shapes layer is set to Auto by default to automatically discover possible shape issues. Auto shape detection depends on the row count of your dataset to determine the percentage an identified shape adheres to the shape format that Collibra DQ identifies as normal. Anything less than that percentage triggers a break. The following table provides a breakdown of occurrence percentage as determined by row count.

Row Count Percentage
< 100 2.0
< 1,000 1.0
< 10,000 0.1
< 100,000 0.05
< 1,000,000 0.01
< 10,000,000 0.001
< 100,000,000 0.0001
Other 0.00001

Manually configuring shape detection

Note With the exception of Data Shape Granular, these options are only configurable when the toggle is set to Manual.

  1. Select the Manual option.
  2. Select the option next to the column name to include that column in the Shapes layer.
  3. You can select as many columns as there are available.
  4. Optionally click , then click Configuration for additional Shapes configuration options.
    The Shapes Configuration modal appears.
    1. Optionally configure Manual Options and Data Shape Granular. The following table shows the available configuration options.
    2. OptionDescriptionDefault Value
      Manual Options
      Occurrences

      Only shows shapes below the given percentage threshold.

      Drag the slider or enter a value in the input field to set the threshold to a value between 0.001 and 5.

      0.001
      Format per ColumnOnly shows columns with less than the given number of formats. For example, when there are 30 formats and the Format per Column value is set to 5, then only 5 columns display.20
      Character LengthOnly shows Shape issues with fewer than the given number of characters. For example, if the value is set to 9, then all values less than 9 will show.12
      Data Shape Granular
      Data Shape Granular

      Checks the length of alphanumeric fields and checks for numbers and letters independently.

      Auto

      Note The Auto setting leaves Data Shape Granular turned off by default. To apply a data shape granular check, select On.

    3. Click Save.
    4. Your configurations save and the Shapes Configuration modal closes.
  5. Click Save.

Using Outliers

Outliers are values that differ significantly from the rest of the data and may indicate bad or incorrect data. Numerical outliers are detected with the IQR and box plot methods.

Adding Outliers

  1. From the Outliers tab in the Add Layers workflow, click Add Outlier.
    The Outlier editor displays.
  2. Select the required options.
  3. Column Description
    checkbox

    The column(s) used to detect outliers. You can select multiple columns for outlier detection. Required.

    Select one or more checkboxes next to the name of the columns to include them in Outlier analysis.

    Column

    The name of the column in your table or view.

    Click the Assign chip in a Key column to uniquely identify the rows of the column.

    Note You can assign only one column as the Key column and it cannot be the column you select for outlier analysis.

    Type

    The type of data in a given column. This column is autofilled and cannot be edited or selected.

    The following is a list of the data types:

    Data type Description
    Date A date.
    DOUBLE A normal-size floating point number.
    VARCHAR A VARIABLE length string that can contain letters, numbers, and special characters.
    text A variable width character string.
    numeric Fixed precision and scale numbers.
    Key

    The key column whose rows have unique identifiers. This column provides context by grouping each column by the key column. When assigned, outliers are grouped to their assigned column when detected. Optional.

    Optionally click the Assign chip to uniquely identify the rows of a column.

    Note You can assign only one column as the Key column and it cannot be the column you select for outlier analysis.

    Date

    The date or time dimension column(s) that are used to define time bins and processing intervals.

    Tip When Assigned, you can configure the Lookback in the Outliers Configuration dialog.

    Note You can assign only one column as the Date column.

  4. Click Save.

Tip Repeat this process to add more Outliers.

Outliers Configuration

Optionally click the three dots menu icon to configure Quartiles, Lookback, and determine whether or not to detect for Categorical outliers. The following table shows the configuration options and their descriptions:

Option Description
Quartile  
Quartile 1
The boundaries beyond which low (Q1) outliers are detected using the IQR formula. Use the slider or enter a value in the text box between 0.01 and 0.45.
Quartile 3
The boundaries beyond which high (Q3) outliers are detected using the IQR formula. Use the slider or enter a value in the text box between 0.55 and 0.99.
Lookback

The lookback period of your outlier scan. Visit the Lookback documentation to learn more.

Note The options in this section are only available when Date is Assigned in the Add Outliers step.

Lookback Period

The period used to look back through the dataset using the date column.

Use the slider or enter a value in the text box between 0 and 30.

Interval

The unit of measurement of the lookback period, such as DAY.

Select an option from the dropdown menu.

By
Select an option from the dropdown menu.
Categorical Select this option to detect for categorical outliers on specific string type columns.

Advanced Options

Optionally click near Advanced Options to adjust outlier detection sensitivity. The following table shows the available options.

Option Description
Column

The name of the column in your table or view. You cannot edit this option.

Sensitivity

The outlier detection sensitivity.

Click and drag the slider to adjust outlier detection sensitivity from Low to High for a given column.

Unit

The unit of outlier detection sensitivity.

Select an option from the dropdown menu.

Warning Patterns are not yet available for Pushdown. Patterns will be available in an upcoming Collibra DQ release.

About Patterns

Patterns detect similarities among string values across columns and downscore any observations.

About Dupes

Dupes are values that match other existing values in columns and can be set to detect exact and fuzzy matches. Both exact and fuzzy support case-insensitive matching.

Exact matches

Exact matches detect entries that are potentially equivalent despite formatting issues. For example, Collibra DQ flags the column values John Doe, JOHN DOE, and john doe as exact match dupes of each other when Exact Match is on and Case Sensitive is off.

Note If Case Sensitive is on, then the values in the previous example are not equivalent.

Fuzzy matches

Fuzzy matches detect entries that are potentially equivalent despite misspellings, common spelling variations, and more. For example, Collibra DQ flags the column values John Doe, Johnny Doe, and Johhnn Doe as fuzzy match dupes of each other. The fuzzy match algorithm complexity is O² and compares every two fields in the dataset.

Configuring Dupes

  1. From the Dupes tab in the Add Layers workflow, click Add Dupes.
    The Dupes editor displays.
  2. Enter the required information.
  3. Option Description
    checkbox

    The column(s) used to detect dupes. You can select multiple columns for dupes detection. Required.

    Select at least one and no more than 10 checkboxes next to the name of the columns to include them in the Dupes analysis.

    Column The name of the column in your table or view.
    Type

    The type of data in a given column. This column is auto-filled and cannot be edited or selected.

    Collibra DQ supports all standard data types. Collibra DQ also supports some custom data types depending on the data source, for example, Snowflake.

    Case Sensitive

    Detects exact duplicates that are dependent on the casing format of values.

    When switched on, Collibra DQ includes case sensitivity in the dupes detection and flags findings as exact matches.

    When switched off, Collibra DQ does not include case sensitivity in the dupes detection and flags findings as exact matches.

    Exact Match

    When switched on, Collibra DQ detects entries that may be potential matches despite formatting issues.

    Note Exact match findings are case insensitive by default. Ensure that Case Sensitive is switched off for Exact Match to work properly.

    Dupes Score

    The percentage match Collibra DQ will detect as fuzzy matches.

    Drag the scoring slider to define the match percentage limits.

  4. Click Save.