DQ Job Link ID
Link ID is an out-of-the-box feature that lets you link the findings of a DQ Job back to the source record, or key, for remediation outside the application. The link ID should be unique and is most commonly the primary key. Composite primary key is also supported.
To view the rule break records, any link ID columns included in your dataset must be the only columns specified in your SELECT
query. Alternatively, SELECT * FROM
also allows you to view rule break records because it includes all dataset columns in your query.
For example, a NYSE dataset with the columns "close" and "exch" selected as link ID columns, the following queries allow you to view rule break records:
-
SELECT close, exch FROM ... # both link ID columns are included in the SELECT statement
-
SELECT * FROM ... # SELECT * statement includes all columns, including link IDs
Conversely, when "close" or "exch" are the only specified link ID columns in the same NYSE dataset, the following query will not display rule break records in the Rule Breaks modal on the Findings page:
-
SELECT close FROM ... # only one of the link ID columns included in the SELECT statement
Collibra DQ supports one or many primary key columns in your dataset for record linkage to your original table, file, or data frame. If your primary key column contains many columns, use a comma to delineate.
Combining link ID and Run Discovery
To combine the features of link ID and Run Discovery, first enable link ID and then use Rule Discovery on the Rules tab of the Findings page. This lets you apply sensitivity labels to data classes and trigger breaks for all the records that do not match your link ID.
Link ID and the Metastore
No personal data is stored in the Metastore when using link ID. The Metastore only stores:
- The rule applied to your DQ Job.
- The dataset used for your DQ Job.
- The column of reference.
- The link ID.
For further reading on sensitive information, see the Data Class and Sensitivity Labels documentation.
Viewing break records
You can view rule break records from the Rules tab on the Findings page. Rules with break records have associated link IDs that link back to the original dataset. All remediation for data quality issues is performed outside of Collibra DQ.
Steps
- From Explorer, select a data source.Follow the first 2 steps on creating DQ Jobs.
- From the Job Creator step, click Partial Scan.
- In the Select Columns step, click Assign in the Link ID column to assign a column in your Job as a link ID.
- Click Assign in the Key column to assign one or more columns in your Job as the primary key(s). You can create a composite key, by selecting multiple columns.
- Optionally select your rows.
- Optionally add transforms.
- Optionally create a mapping.
- Optionally add data quality monitors.
- Review the sizing of your Job.
- Review your Job.
- Run your Job.
- Open the Jobs page and click the name of your DQ Job from the list.
- Click the Rules link in the metadata bar at the top of the page. The Dataset Rules page opens.
- Add a rule.
- Re-run your Job.
- Open the Jobs page and click the name of your DQ Job from the list.
The Findings page of your DQ Job opens. - Click the Rules tab.
- In the far right Actions column, click Actions. The Rule Breaks dialog appears.
- View or download the rule breaks, observing that the data from the column you assigned as the Link ID column in Step 3 above is present in the linkId column of the CSV file and Sample File Preview.
Note All Simple (SQLG) and Freeform (SQLF) rules are eligible for link ID.
Notebook
val opt = new OwlOptions()
opt.runId = "2018-02-24"
opt.dataset = "orders"
opt.linkId = Array("transaction_id", "trans_time")
Command Line
./owlcheck -ds orders \
-rd "2018-02-24" \
-linkid transaction_id,trans_time
Note For rules to use linkID, the columns need to be present in the select statement (either select * or select specific column names). All Simple rules are eligible for linkID and Freeform rules need to contain the columns in the projection part of the SQL statement.
Activity Usage
Activity | Supported | Description |
---|---|---|
SHAPE | Yes | One example of each shape issue will have a link back to the corrupt record for remediation. |
OUTLIER | Yes | Each outlier will have a link back to the detected record for remediation. If you apply a limit you will only get the limited amount. Not on categorical. |
DUPE | Yes | Each duplicate or fuzzy match will have a link back to the record for remediation. |
SOURCE | Partial | Each source record that has a cell value that doesn't match to the target will have a link for remediation. SOURCE will not have links for row counts and schema as these are not record level findings. |
RULE | Yes | Break records for Freeform and Simple rule types will be stored (any records that did not meet the condition of the RULE will be provided with the linkID columns). These are stored as delimited strings in the rule_breaks table along with the dataset, run_id and rule name. Please note when using Freeform SQL the linkID columns should be part of the select statement. LinkID columns should be unique identifiers. |
BEHAVIOR | No | This class of data change is when a a section of your data is drifting from its normal tendency there is no 1 record to link. |
SCHEMA | No | This class of data change is at a schema/dataset level there are no records to link. |
RECORD | Partial | In some cases when a record is added or removed it may be available for linking. |
PATTERN | No | Patterns are not always a direct link. This item is still under performance review. |
Notebook API Example
+------------+----------+-------+-------+-----+-----------------+---------------+
| dataset| runId|fieldNm| format|count| percent| transaction_id|
+------------+----------+-------+-------+-----+-----------------+---------------+
| order |2018-02-24| fname|xxxx'x.| 1|7.142857142857142|t-1232 |
+------------+----------+-------+-------+-----+-----------------+---------------+
owl.getShapesDF
Rest API Example
When supplying a linkID, Collibra DQ naturally excludes this field from most activities, meaning a unique ID or primary key column can not be duplicative or it would not be the primary key. Because of this, it is not evaluated for duplicates. The same is true for Outliers and Shapes, as a large sequence number or other variations might trigger a false positive when this column is denoted to be simply for the purpose of linking uniquely back to the source. If you also want to evaluate this column and link it, create a derived column with a different name and Collibra DQ will naturally handle both cases.
owl.getShapes
owl.getDupes
owl.getOutliers
owl.getRuleBreaks
owl.getSourceBreaks
getRules()
----Rules----
+-----------------+----------+--------------------+------------------+------+
| dataset| runId| ruleNm| ruleValue|linkId|
+-----------------+----------+--------------------+------------------+------+
|dataset_outlier_3|2018-02-24| fname_like_Kirk|fname like 'Kirk' | c-41|
|dataset_outlier_3|2018-02-24| fname_like_Kirk|fname like 'Kirk' | c-42|
|dataset_outlier_3|2018-02-24| fname_like_Kirk|fname like 'Kirk' | c-43|
|dataset_outlier_3|2018-02-24| fname_like_Kirk|fname like 'Kirk' | c-44|
|dataset_outlier_3|2018-02-24| fname_like_Kirk|fname like 'Kirk' | c-45|
|dataset_outlier_3|2018-02-24|if_email_is_valid...| email| c-31|
|dataset_outlier_3|2018-02-24|if_email_is_valid...| email| c-33|
|dataset_outlier_3|2018-02-24|if_zip_is_valid_Z...| zip| c-40|
+-----------------+----------+--------------------+------------------+------+
getDupes()
First split on ~~ then if you have a multiple part key split on ~|.
----Dupes----
+-----------------+----------+-----+--------------------+----------+
| dataset| runId|score| key| linkId|
+-----------------+----------+-----+--------------------+----------+
|dataset_outlier_3|2018-02-24| 100|9ec828d5194fa397b...|c-45~~c-36|
|dataset_outlier_3|2018-02-24| 100|1f96274d1d10c9f77...|c-45~~c-35|
|dataset_outlier_3|2018-02-24| 100|051532044be286f99...|c-45~~c-44|
|dataset_outlier_3|2018-02-24| 100|af2e96921ae53674a...|c-45~~c-43|
|dataset_outlier_3|2018-02-24| 100|ad6f04bf98b38117a...|c-45~~c-42|
|dataset_outlier_3|2018-02-24| 100|1ff7d50a7a9d07d02...|c-45~~c-41|
|dataset_outlier_3|2018-02-24| 100|6ed858ed1f4178bb0...|c-45~~c-40|
|dataset_outlier_3|2018-02-24| 100|d2903703b348fb4cb...|c-45~~c-39|
|dataset_outlier_3|2018-02-24| 100|24bf54412de1e720d...|c-45~~c-38|
|dataset_outlier_3|2018-02-24| 100|7a7ce0beb41b39564...|c-45~~c-37|
+-----------------+----------+-----+--------------------+----------+
getRuleBreaks()
The getRuleBreaks endpoint retrieves all broken records within your dataset. There is no size limit to this API.
----Rule-Breaks----
+-----------------+----------+--------------------+------+
| dataset| runId| ruleNm|linkId|
+-----------------+----------+--------------------+------+
|dataset_outlier_3|2018-02-24| fname_like_Kirk| c-41|
|dataset_outlier_3|2018-02-24| fname_like_Kirk| c-42|
|dataset_outlier_3|2018-02-24| fname_like_Kirk| c-43|
|dataset_outlier_3|2018-02-24| fname_like_Kirk| c-44|
|dataset_outlier_3|2018-02-24| fname_like_Kirk| c-45|
|dataset_outlier_3|2018-02-24|if_email_is_valid...| c-31|
|dataset_outlier_3|2018-02-24|if_email_is_valid...| c-33|
|dataset_outlier_3|2018-02-24|if_zip_is_valid_Z...| c-40|
+-----------------+----------+--------------------+------+