Working with link IDs

Link ID is an out-of-the-box feature that allows you to link the findings of a DQ Job back to the source record, or key, for remediation outside the application. You can specify the link ID when selecting columns during the job creation process. Depending on whether you run jobs in Pullup or Pushdown, the link ID requirements may differ, as shown in the following table.

Processing method Link ID required for break record archival?
Pushdown

No

Important Optional for rules; required for dupes, outliers, and shapes.

Pullup

Yes

Importance of uniqueness

The link ID should be unique and should not contain any null values. It is most commonly the primary key. Depending on the dataset, you may need to select multiple columns to create a composite primary key. Otherwise, if the link ID is not unique, the DQ job may not be able to isolate the specific break records.

Link ID example using sales data

Suppose the dataset contains the following sales data:

NAME TRDATE SALES
John 2022-05-01 1000
John 2022-05-02 2000
John 2022-05-03 1000
John 2022-05-04 2000
John 2022-05-05 0
Steve 2022-05-01 300
Steve 2022-05-02 400
Steve 2022-05-03 200
Steve 2022-05-04 500
Steve 2022-05-05 10000

In this example, the rule specifies that SALES = 0 is a break record. Notice that this condition exists for "John" on "2022-05-05".

Suppose the link ID is set to only one column, NAME, which is not a unique value. In this case, the rule correctly identifies a break record, however, the DQ Job cannot identify the row that constitutes the break record. It can only determine the NAME of the sales associate with the break record "John".

If the link ID is set to both NAME and TRDATE, which together constitutes a unique key, the job can correctly identify the row that constitutes the break record.

Link ID example using NYSE data

In this example, a NYSE dataset has the columns CLOSE and SYMBOL selected as link ID columns. The following rule queries allow you to view rule break records:

  • # both link ID columns are included in the SELECT statement

    SELECT CLOSE, SYMBOL FROM ...     
  • # SELECT * statement includes all columns, including link IDs

    SELECT * FROM ...     

To view the rule break records, you can specify the link ID columns in your SELECT query. Alternatively, SELECT * FROM also allows you to view rule break records because it includes all dataset columns in your query.

Conversely, when CLOSE and SYMBOL are the only specified link ID columns in the same NYSE dataset, the following query will not display rule break records in the Rule Breaks dialog box on the Findings page:

  • # only one of the link ID columns included in the SELECT statement

    SELECT close FROM ...     

Data Quality & Observability Classic supports one or many primary key columns in your dataset for record linkage to your original table, file, or data frame. If your primary key column contains many columns, use a comma to separate them.

No personal data is stored in the Metastore when using link ID (unless, of course, the column used as the link ID contains PII). The Metastore only stores:

  • The rule applied to your DQ Job.
  • The dataset used for your DQ Job.
  • The column of reference.
  • The link ID.

For more on sensitive information, go to the Data Class and Sensitivity Labels documentation.

Set up link IDs

  1. Create a DQ job as described in Create a DQ Job. Be sure to assign link IDs when selecting your columns.
  2. Before running the job, open the settings to configure the desired break record archive behavior. Depending on the type of job, follow the steps provided based on the type of job:

View break records

You can view rule break records using one of two methods:

  • View break records from the Rules tab of the Findings page.
  • Export a CSV using the Actions button on the Rules tab.

Rules with break records have associated link IDs that link back to the original dataset.

View break records from the Findings page

  1. On the Findings page, open the Rules tab.
  2. Click the Actions button and select Rule Breaks.
  3. The Rule Breaks dialog box opens.
Note You can also use the Rule Discovery feature to apply sensitivity labels to data classes and trigger breaks for all the records that do not match your link ID. The Rule Discovery feature is described in more detail on the Rules tab of the Rule Exploring data quality findings page.

Export a CSV with link IDs

  1. On the Findings page, open the Rules tab.
  2. Click the Actions button and select Rule Breaks.
  3. The Rule Breaks dialog box opens.
  4. Click Download CSV.
Note You can also export a .xlsx file that lists the link IDs set up for your dataset from the Export tab on the Findings page.

Activity Usage

Activity Supported Description
SHAPE Yes One example of each shape issue will have a link back to the corrupt record for remediation.
OUTLIER Yes Each outlier will have a link back to the detected record for remediation. If you apply a limit you will only get the limited amount. Not available on categorical.
DUPE No Each duplicate or fuzzy match will have a link back to the record for remediation.
SOURCE Partial Each source record that has a cell value that doesn't match to the target will have a link for remediation. SOURCE will not have links for row counts and schema as these are not record level findings.
RULE Yes

Break records for Freeform and Simple rule types will be stored (any records that did meet the condition of the RULE will be provided with the linkID columns). These are stored as delimited strings in the rule_breaks table along with the dataset, run_id and rule name. Please note that, when using Freeform SQL, the linkID columns should be part of the select statement. LinkID columns should be unique identifiers.

When incorporating secondary datasets of the same or different connection type into your rule, please ensure that your rule structure supports data preview visibility. In cases in which no preview records appear, consider refining the query to reference only the required columns (for example, the link ID from the primary dataset) instead of using broad SELECT * statements across both datasets.

BEHAVIOR No This class of data change is when a section of your data is drifting from its normal tendency there is no 1 record to link.
SCHEMA No This class of data change is at a schema/dataset level there are no records to link.
RECORD Partial In some cases when a record is added or removed it may be available for linking.
PATTERN No Patterns are not always a direct link.

What's next?