Create a data quality rule

This section shows how you can design and build a SQL rule on the Rule Workbench.

Steps

Select a navigation method from the following tabs.

  1. Click the Rules wrench icon in the sidebar menu, then click Rule Builder.
  2. The Dataset Rules page opens.
  3. There are two ways to open the Rule Workbench.
    1. From an existing rule on the Rules tab, click Actions, then click Edit.
      The Rule Workbench page opens with the preferences of your existing rule.
    2. Click Add Rule in the upper right corner to create a new rule.
      The Rule Workbench page opens.
  4. Select a rule type from the Type dropdown menu.
  5. Option Description
    User Input User Input rules are custom SQL statements that can be either short or long form, depending on the complexity.
    Simple Rule (SQLG)

    Simple rules are SQL statements that use just the condition and primary column name. For example, AMOUNT > 100

    Simple Rules use Spark SQL syntax.

    Freeform SQL (SQLF)

    Freeform rules are complete SQL statements that are commonly used when more complex SQL rules are necessary. Freeform Rules can include multiple conditions, table joins, common table expressions (CTEs), and window statements, among other operations.

    Freeform Rules use Spark SQL syntax.

    Native SQL

    Native rules use SQL statements with functions or expressions that may be specific to the underlying connection or database (Database SQL). For example, a Native SQL rule written on a dataset whose data resides in Snowflake uses the SQL query syntax native to Snowflake.

    Remote files are not eligible for Native SQL rules.

    Note All Pushdown rules use the SQL dialect native to the specific database of your source data. Because all Pushdown rules technically use Native (Database) SQL, the Native SQL option is not available for Pushdown jobs on the Rule Workbench. To check whether a rule conforms to the SQL dialect of the source database, click validate rule icon Validate after you write your query.

    Data Type Data Type rules check for nulls, empty values, or specific data types.
    Empty Check
    Checks whether the target column has empty values or not.
    Null Check
    Checks whether the target column has NULL values or not.
    Date Check
    Checks whether the target column has only DATE values or not.
    Int Check
    Checks whether the target column has only integer values or not.
    Double Check
    Checks whether the target column has only DOUBLE values or not.
    String Check
    Checks whether the target column has only STRING values or not.
    Mixed Type Check

    Checks the dataType of the field.

    Note Mixed datatype rules are not supported in Pushdown.

    Data Class See the Data Class page for a list of out-of-the-box Data Class rules.
    Template See the Template page for a list of out-of-the-box Template rules.
  6. Enter a name for your rule in the Rule Name input field.
  7. Important Rule names cannot contain spaces.

  8. Select a primary column from the Primary Col dropdown menu.
  9. Note Primary Column is only required when you use a Data Type, Data Class, or Template rule.

  10. Optionally click to add a secondary dataset.
    1. Enter the name of your secondary dataset in the search bar when the Add Secondary Dataset dialog appears, then click Add Secondary Dataset.
  11. Write your SQL query. When you select the Freeform rule type, as shown in the GIF below, the Workbench pre-populates the beginning of the query for you. For example, SELECT * FROM @<dataset name>. As you type, the Workbench provides a suggested options with common SQL operators and column names relative to your primary and secondary datasets.
  12. sample query sql builder gif

    Tip For a complete list of Spark SQL operators and functions available for use on SQLG and SQLF rule types, see the official Spark documentation.

  13. Optionally add a filter query in the Filter input field.
  14. Click validate rule icon Validate to validate your expression.
  15. Optionally click prettify sql icon Beautify to format your SQL query in a more readable layout.
  16. Click Save to apply your rule to your dataset.
  1. On the upper right corner of the Findings page, click the Rules link in the metadata box.
  2. The Dataset Rules page opens.image showing navigation to the rule workbench from the findings page
  3. There are two ways to open the Rule Workbench.
    1. From an existing rule on the Rules tab, click Actions, then click Edit.
      The Rule Workbench page opens with the preferences of your existing rule.
    2. Click Add Rule in the upper right corner to create a new rule.
      The Rule Workbench page opens.
  4. Select a rule type from the Type dropdown menu.
  5. Option Description
    User Input User Input rules are custom SQL statements that can be either short or long form, depending on the complexity.
    Simple Rule (SQLG)

    Simple rules are SQL statements that use just the condition and primary column name. For example, AMOUNT > 100

    Simple Rules use Spark SQL syntax.

    Freeform SQL (SQLF)

    Freeform rules are complete SQL statements that are commonly used when more complex SQL rules are necessary. Freeform Rules can include multiple conditions, table joins, common table expressions (CTEs), and window statements, among other operations.

    Freeform Rules use Spark SQL syntax.

    Native SQL

    Native rules use SQL statements with functions or expressions that may be specific to the underlying connection or database (Database SQL). For example, a Native SQL rule written on a dataset whose data resides in Snowflake uses the SQL query syntax native to Snowflake.

    Remote files are not eligible for Native SQL rules.

    Note All Pushdown rules use the SQL dialect native to the specific database of your source data. Because all Pushdown rules technically use Native (Database) SQL, the Native SQL option is not available for Pushdown jobs on the Rule Workbench. To check whether a rule conforms to the SQL dialect of the source database, click validate rule icon Validate after you write your query.

    Data Type Data Type rules check for nulls, empty values, or specific data types.
    Empty Check
    Checks whether the target column has empty values or not.
    Null Check
    Checks whether the target column has NULL values or not.
    Date Check
    Checks whether the target column has only DATE values or not.
    Int Check
    Checks whether the target column has only integer values or not.
    Double Check
    Checks whether the target column has only DOUBLE values or not.
    String Check
    Checks whether the target column has only STRING values or not.
    Mixed Type Check

    Checks the dataType of the field.

    Note Mixed datatype rules are not supported in Pushdown.

    Data Class See the Data Class page for a list of out-of-the-box Data Class rules.
    Template See the Template page for a list of out-of-the-box Template rules.
  6. Enter a name for your rule in the Rule Name input field.
  7. Important Rule names cannot contain spaces.

  8. Select a primary column from the Primary Col dropdown menu.
  9. Note Primary Column is only required when you use a Data Type, Data Class, or Template rule.

  10. Optionally click to add a secondary dataset.
    1. Enter the name of your secondary dataset in the search bar when the Add Secondary Dataset dialog appears, then click Add Secondary Dataset.
  11. Write your SQL query. When you select the Freeform rule type, as shown in the GIF below, the Workbench pre-populates the beginning of the query for you. For example, SELECT * FROM @<dataset name>. As you type, the Workbench provides a suggested options with common SQL operators and column names relative to your primary and secondary datasets.
  12. sample query sql builder gif

    Tip For a complete list of Spark SQL operators and functions available for use on SQLG and SQLF rule types, see the official Spark documentation.

  13. Optionally add a filter query in the Filter input field.
  14. Click validate rule icon Validate to validate your expression.
  15. Optionally click prettify sql icon Beautify to format your SQL query in a more readable layout.
  16. Click Save to apply your rule to your dataset.

Configuring Rule Details

To configure rule details, click the Cogwheel icon in the upper right corner of the Workbench to open the Rule Details modal.

The following table shows the optional settings when you set up a rule in the Rule Workbench.

Setting Description
DQ Dimension

Associate your rule with a DQ Dimension, as shown in the following table. This can be especially useful when you use the Collibra Data Intelligence Platform integration and custom reporting.

Dimension Description
Completeness Completeness refers to the degree to which potential data contained in a dataset is reflected. Refers to the percentage of columns that have neither EMPTY nor NULL values.
Timeliness Timeliness refers to the degree to which data is relevant and up-to-date at a certain point in time.
Uniqueness Uniqueness refers to the degree to which data contains only one record of how an entity is identified. Refers to the cardinality of columns in your dataset.
Validity Validity refers to the degree to which data conforms to its defining constraints or conditions, which can include data type, range, or format.
Consistency Consistency refers to the degree to which data contains differing, contradicting, or conflicting entries.
Accuracy Accuracy refers to the degree to which data correctly reflects its intended values.

Note Tagging rules with custom dimensions in the Metastore is not supported.

Category You can create and apply custom categories to your rule to group similar rules together. For example, you can create categories for "Rules for missing data" or "Rules to identify format issues".
Description Give a detailed description of your rule.
Preview Limit (Rows)

The number of available rows to preview when you drill into a rule record.

The default value is 6 records and the maximum value is 500.

Note Rule Break Preview is only supported for Freeform and Simple rules.

Run Time Limit (Minutes) The maximum number of minutes any rule associated with a dataset can take to process before it times out and proceeds to the next activity in the job. If you increase this value to one greater than the previous maximum Run Time Limit of any rule associated with your dataset, it becomes the new maximum timeout limit.
Score
Score Description
Low Downscores 1 point for every 1% of records that violate your rule. For example, when 25 of 100 records break your rule, then 25 points are deducted from the total DQ score.
Medium Downscores 5 points for every 1% of records that violate your rule. For example, when 2 of 100 records break your rule, then 10 points are deducted from the total data quality score.
High Downscores 25 points for every 1% of records that violate your rule. For example, when 3 of 100 records break your rule, then 75 points are deducted from the total data quality score.
Scoring Type
Type Description
Percent Default. Deducts points according to the point threshold you set against every percentage of records in violation of your rule. For example, If 25 of 100 records break your rule, and the values you set in both the Points and Percent input fields are 1, then a total of 25 points are deducted from your overall data quality score.
Absolute Deducts points according to the scoring threshold you set in the Points input field. When your rule breaks, the scoring model deducts the total number of points you set. For example, if you set the value of the Points input field to 10 and your rule breaks, then a total of 10 points are deducted from your overall data quality score.
Points A value between 0-100, representing the number of points deducted from the total quality score when a record violates your rule.
Percent

When you use the Percent scoring type, this can be a value between 0-100. This represents the ratio of the total number of breaking records over the total number of rows.

Note If you select Absolute as the scoring type, then this option is not available.

Managing a Livy Session

You can preview the columns in your dataset to help you create SQL queries for rules on Pullup jobs. To do this, ensure that you allocate ample resources for Livy to run efficiently. Livy starts when you click Run Result Preview and has 1 core, 1 GB of memory, and 1 worker by default. For large datasets that require increased processing resources, you may need to modify your Livy session to improve the loading time of your Result Preview.

Note When you use Result Preview, there is a 500-row limit on the preview, regardless of the number of rows are in the results of your query.

  1. Click Manage Session.
    The Manage Livy Session modal appears.
  2. Enter the required information.
  3. Setting Description
    Cores The number of executor cores for Spark processing. The minimum value is 1.
    Memory The amount of memory allocated for Spark processing. Allocate at least 1 GB of memory per Livy pod.
    Workers

    The number of executors available in your Livy session. The minimum value is 1.

    When you increase the number of workers, an additional Livy session is created for each additional worker you specify. For example, if you increase your workers from 1 to 3, then 2 additional Livy sessions are created for a total of 3.

    Tip Adjust the Livy resources based on the number of rows included in your query. For columns with a large amount of rows, you may find that you need to increase the memory and workers.

  4. Choose whether to acknowledge that all cached preview data will be lost when you update your Livy session.
  5. Important To update your Livy session, you must select this option.

  6. Click Update.

Note Livy is not required to use the Result Preview feature for Pushdown.