Apply a tolerance threshold to a data quality rule (beta)

Important This feature is currently available as a public beta option. For more information about beta features, see Betas at Collibra.

The rule tolerance option on the Rule Workbench allows you to define the percentage of rule breaks that must be found to affect rule scores and statuses, and for any condition alerts tied to the rule to send.

By setting a rule tolerance threshold, you can reduce the number of "breaking" rule results displayed on the Findings page. This helps minimize the number of alert notifications that are sent due to less significant breaking rules. As a result, you can focus on the most critical data quality problems while reducing irrelevant alerts, making it easier to identify actual data quality issues and keeping your inbox or notification feed less cluttered.

In this topic

Prerequisites

You have:

  • ROLE_ADMIN, ROLE_DATASET_RULES, or ROLE_VIEW_DATA
  • A user with ROLE_ADMIN has set the RULE_TOLERANCE_BETA setting to TRUE on the Application Configuration Settings page.
  • An existing dataset against which you want to write a rule.

Adding a rule tolerance threshold

  1. Click the Rules link on the metadata bar on a dataset-level page or open the Dataset Rules page of a dataset by clicking the Rules wrench icon in the sidebar menu, then clicking Rule Builder.
  2. The Dataset Rules page opens.
  3. Follow the steps to add a rule from the Rule Workbench.
  4. Show Rule Workbench steps
  5. There are two ways to open the Rule Workbench.
    1. From an existing rule on the Rules tab, click Actions, then click Edit.
      The Rule Workbench page opens with the preferences of your existing rule.
    2. Click Add Rule in the upper right corner to create a new rule.
      The Rule Workbench page opens.
  6. Select a rule type from the Type dropdown menu.
  7. Enter a name for your rule in the Rule Name input field.
  8. Important Rule names cannot contain spaces.

  9. Select a primary column from the Primary Col dropdown menu.
  10. Note Primary Column is only required when you use a Data Type, Data Class, or Template rule.

  11. Optionally click to add a secondary dataset.
    1. Enter the name of your secondary dataset in the search bar when the Add Secondary Dataset dialog appears, then click Add Secondary Dataset.
  12. Enter your SQL query in the Rule (required) input field.
  13. Click Cogwheel icon in the upper right corner of the Rule Workbench.
  14. The Rule Details dialog appears.
  15. In the Tolerance input field, enter or select a value between 0 and 100.
  16. Note The value 0 is the same as not adding a rule tolerance threshold at all. In other words, any percentage of rule breaks will be considered breaking.

    Show screenshotimage of a filter query on the rule workbench
  17. Click Submit.
  18. Click Save to apply the rule to your dataset.
  19. Rerun your DQ Job.
  20. The results of your run are available on the Rules tab of the Findings page.

    Note Rules that contain a tolerance threshold are marked with an asterisk to the right of the Points value. For example, 0*.

Viewing rule tolerance results

When you run a dataset with rules that include tolerance thresholds, only the rules with percentages of breaking records greater than their tolerance thresholds affect the data quality score on the Findings page. When a rule contains a percentage of breaking records under its tolerance threshold, an asterisk displays next to a value of 0 in the Points column of the Rules tab.

To illuminate the scoring and alerting impact of various rule thresholds, consider a scenario where there are two identical rules that query a total of 100 records and a tolerance threshold of 50 is set for both rules.

In the first rule, 75 records break the conditions of the rule, meaning 75% of the total records are breaking. In this case, even with a tolerance threshold set, 75 points are deducted from the overall data quality score, because 75% exceeds the tolerance threshold of 50%. This rule is considered "Breaking" and, if configured, an alert will be sent.

However, in the second rule, only 43% of all records queried break its conditions. Because 43% is below the tolerance threshold of 50%, the rule is considered "Passing," no points are deducted from the data quality score, and alerts associated with the rule are suppressed.

The screenshot and table below further illustrate a different scenario where two identical rules query a column with and without a rule tolerance threshold applied.

rule tolerance on findings page

The following table provides an explanation of the values present in each relevant column in the above screenshot of the Rules tab on the Findings page.

  Points Tolerance Perc Breaking, Passing, and Total Records Status
With Tolerance

Because a percentage of 43 is under the tolerance threshold of 50, no points (0*) are being deducted from the data quality score.

Note Rule scores that are impacted by tolerance are marked with an asterisk to the right of the Points value. For example, 0*.

A tolerance threshold of 50% means that any percentage of breaking records less than or equal to it will be ignored in the score calculation. In this case, 43% of the records are breaking, but they do not impact the scoring calculation because of the rule threshold of 50%.

The values in these columns remain unchanged with or without a tolerance value, because the actual percentages need to be available for informational purposes.

The status is listed as Passing because the percentage of breaking records (43%) is under the tolerance threshold of 50%.

Without Tolerance

Because 43 percent of records are breaking and a tolerance threshold is not set, 43 points are being deducted from the data quality score.

A tolerance threshold is not being imposed on the rule.

Since a rule tolerance threshold is not set, the rule is considered breaking because the percentage of breaking records is greater than 0.