Working with Link IDs

Link ID is an out-of-the-box feature that lets you link the findings of a DQ Job back to the source record, or key, for remediation outside the application. Link IDs are required for archiving break records. Specify the link ID when creating a DQ job and selecting columns.

Importance of uniqueness

The link ID should be unique and should not contain any null values. It is most commonly the primary key. Depending on the dataset, you may need to select multiple columns to create a composite primary key. Otherwise, if the link ID is not unique, the DQ job may not be able to isolate the specific break records.

Link ID example using sales data

Suppose the dataset contains the following sales data:

NAME TRDATE SALES
John 2022-05-01 1000
John 2022-05-02 2000
John 2022-05-03 1000
John 2022-05-04 2000
John 2022-05-05 0
Steve 2022-05-01 300
Steve 2022-05-02 400
Steve 2022-05-03 200
Steve 2022-05-04 500
Steve 2022-05-05 10000

In this example, the rule specifies that (SALES=0) is a break record. Notice that this condition exists for "John" on "2022-05-05".

Suppose the link ID is set to only one column, NAME, which is not a unique value. In this case, the rule correctly identifies a break record, however, the DQ job cannot identify the row that constitutes the break record. It can only determine the NAME of the sales associate with the break record ("John").

If the link ID is set to both NAME and TRDATE, which together constitutes a unique key, the job can correctly identify the row that constitutes the break record.

Link ID example using NYSE data

In this example, a NYSE dataset with the columns CLOSE and SYMBOL selected as link ID columns. The following rule queries allow you to view rule break records:

  • SELECT CLOSE, SYMBOL FROM ...     # both link ID columns are included in the SELECT statement
  • SELECT * FROM ...     # SELECT * statement includes all columns, including link IDs

To view the rule break records, specify the link ID columns in your SELECT query. Alternatively, SELECT * FROM also allows you to view rule break records because it includes all dataset columns in your query.

Conversely, when CLOSE and SYMBOL are the only specified link ID columns in the same NYSE dataset, the following query will not display rule break records in the Rule Breaks modal on the Findings page:

  • SELECT close FROM ...     # only one of the link ID columns included in the SELECT statement

Collibra DQ supports one or many primary key columns in your dataset for record linkage to your original table, file, or data frame. If your primary key column contains many columns, use a comma to delineate.

No personal data is stored in the Metastore when using link ID (unless, of course, the column used as the link ID contains PII). The Metastore only stores:

  • The rule applied to your DQ Job.
  • The dataset used for your DQ Job.
  • The column of reference.
  • The link ID.

For further reading on sensitive information, see the Data Class and Sensitivity Labels documentation.

Set up link IDs

  1. Create a DQ job as described in Create a DQ Job. Be sure to assign link IDs when selecting your columns.

  2. Before running the job, open the settings to configure the desired break record archive behavior. Depending on the type of job, follow the steps provided based on the type of job:

View break records

You can view rule break records using one of two methods:

  • View break records from the Rules tab of the Findings page.
  • Export a CSV using the Actions button on the Rules tab.

Rules with break records have associated link IDs that link back to the original dataset.

View break records from the Findings page

  1. On the Findings page, open the Rules tab.

  2. Click the Actions button and select Rule Breaks. The Rule Breaks dialog box opens.

Note You can also use the Rule Discovery feature to apply sensitivity labels to data classes and trigger breaks for all the records that do not match your link ID. The Rule Discovery feature is described in more detail on the Rules tab of the Rule Exploring data quality findings page.

Export a CSV with link IDs

  1. On the Findings page, open the Rules tab.

  2. Click the Actions button and select Rule Breaks. The Rule Breaks dialog box opens.

  3. Click Download CSV.

Note You can also export a .xlsx file that lists the link IDs set up for your dataset from the Export tab on the Findings page.

Activity Usage

Activity Supported Description
SHAPE Yes One example of each shape issue will have a link back to the corrupt record for remediation.
OUTLIER Yes Each outlier will have a link back to the detected record for remediation. If you apply a limit you will only get the limited amount. Not available on categorical.
DUPE No Each duplicate or fuzzy match will have a link back to the record for remediation.
SOURCE Partial Each source record that has a cell value that doesn't match to the target will have a link for remediation. SOURCE will not have links for row counts and schema as these are not record level findings.
RULE Yes Break records for Freeform and Simple rule types will be stored (any records that did meet the condition of the RULE will be provided with the linkID columns). These are stored as delimited strings in the rule_breaks table along with the dataset, run_id and rule name. Please note that, when using Freeform SQL, the linkID columns should be part of the select statement. LinkID columns should be unique identifiers.
BEHAVIOR No This class of data change is when a section of your data is drifting from its normal tendency there is no 1 record to link.
SCHEMA No This class of data change is at a schema/dataset level there are no records to link.
RECORD Partial In some cases when a record is added or removed it may be available for linking.
PATTERN No Patterns are not always a direct link.