Archiving break records from Pushdown jobs
Collibra Data Quality & Observability Pushdown uses SQL queries native to the data source in order to connect directly to the source system and submit the data quality workload. This gives Collibra DQ the unique opportunity to store break records that are typically written to the PostgreSQL Metastore in a database or schema specifically allocated for Collibra DQ. By archiving break records alongside the database or schema, you can:
- Eliminate network traffic costs for better performance.
- Reduce storage limit restrictions.
- Improve security by generating break records that never leave the source system.
- Enhance your reporting abilities by removing the need to migrate break records from the PostgreSQL Metastore to your data source.
- Join datasets against your break records.
Note When Archive Break Records is enabled, a limited number of break records are still stored in the PostgreSQL Metastore to power the DQ web application. To prevent break record storage in the Metastore, open the settings modal on the Explorer page of your dataset and turn on Data Preview from Source. This prevents records from leaving your data source and removes any others from the Metastore.
The Archive Break Records component uses a workload statement in the form of a data quality query to generate break records that get pushed into the source system for processing. Instead of returning a result set, the workload statement redirects its output to a table in your data source schema or database that is specifically allocated for Collibra DQ. The records written to the data source table contain source record linkage details that can be used to identify the source system record. These are called Link IDs and are required to write break records to the data source. Additionally, these records contain metadata generated by Collibra DQ that you may find useful when reviewing your break records.
When a workload is complete, a query is generated and stored in the data source, which allows you to easily access your break records. A results sample of break records may be optionally retrieved and used to generate a limited data preview of break records in the DQ Metastore.
Setting up break record archival
To automatically write break records to a database or schema in your data source, you first need to configure your connection's output location and enable Archive Break Records from the Settings modal on the Explorer page.
Compatible Pushdown connections
The following Pushdown connections support the archival of break records:
- Snowflake
- Databricks
- BigQuery
- Athena
- Redshift
- Trino
- SQL Server
- SAP HANA
Prerequisites
- You have a Pushdown connection.
- The service account user has write access to the database or schema to archive break records.
Important
Data sources may have different combinations of roles and permissions to allow the creation of break record tables and the ability to write records to them. Therefore, we recommend you review the connection requirements specific to your data source when you use Archive Break Records.
Specify the archive location for the connection
- Open Connection Management and edit your connection.
- In the Edit Connection dialog box, select Archive Break Records.
- Specify the schema to archive break records in the Archive Break Records Location field. Keep in mind the following tips for selecting the schema:
- If you enter an incorrect schema, Collibra DQ automatically uses the default output location, PUBLIC, instead.
- Because it is optional to specify a database, if you only specify the schema, the database defaults to your current database.
- You can enter database.schema when your schema resides outside of the current database. For example, db1.PUBLIC, db2.marketing, and db3.sales.
- After you select the archive location, Collibra DQ creates the following tables in the database to store the break records:
- COLLIBRA_DQ_BREAKS
- COLLIBRA_DQ_DUPLICATES
- COLLIBRA_DQ_OUTLIERS
- COLLIBRA_DQ_RULES
- COLLIBRA_DQ_SHAPES
Create the job
-
Create a DQ job as described in Create a DQ Job.
- In the Select Columns step, assign a Link ID to a column.
- In the lower left corner, click
Settings.
- Select the Archive Break Records option. By default, turning on Archive Break Records enables the export of break records for all available finding types.
- Click Save.
- Set up and run your DQ job.
- When a record breaks, its metadata exports automatically to the data source.
Important To get break records, it's important to specify a Link ID. The column you assign as Link ID should not contain NULL values and its values should be unique, most commonly the primary key. Composite primary key is also supported.
Viewing break record details
You can archive break records from the following finding types of Pushdown jobs:
Finding type | Database table |
---|---|
Rules | COLLIBRA_DQ_RULES |
Outliers | COLLIBRA_DQ_OUTLIERS |
Shapes | COLLIBRA_DQ_SHAPES |
Duplicates | COLLIBRA_DQ_DUPLICATES |
The records for each of these finding types contain a unique set of metadata associated with the particular break type. When Archive Break Records is turned on, Collibra DQ sends this metadata to the break record table in the data source.
Rules
Column name | Type | Default | Description |
---|---|---|---|
key | long | auto increment | The automatically generated number that identifies your break record. |
job_uuid | uuid | NOT NULL | The automatically generated number that uniquely identifies your job. For example, 19745. |
dataset | varchar | NOT NULL | The name of the dataset. |
rule_name | varchar | NOT NULL | The name of the rule. |
exception | varchar | NULL | The exception message of a rule that is neither Passing nor Breaking. |
link_id | varchar | NOT NULL | The link ID value. For example, if from an NYSE dataset you assign one column called "Symbol" and another column called "Trade Date" as Link IDs, then you might see a value in a format such as DST~|2018-01-18. |
Outliers
Column name | Type | Default | Description |
---|---|---|---|
key | long | auto increment | The automatically generated number that identifies your break record. |
job_uuid | uuid | NOT NULL | The automatically generated number that uniquely identifies your job. For example, 19745. |
dataset | varchar | NOT NULL | The name of the dataset. |
id | int | NOT NULL | The job ID associated with the break record. |
key_column | varchar | NULL | When a key column is assigned, this is the name of that column. |
key_value | varchar | NULL | When a key column is assigned, this is the value in that column. |
value_column | varchar | NOT NULL | The name of the column that contains the outlier. |
value | varchar | NOT NULL | The value of the outlier finding. |
date_value | date | NULL | When a date column is assigned, this is the date column value. |
type | varchar | NOT NULL | Indicates whether the outlier finding is categorical or numerical. |
median | varchar | NULL | The median of the value of a numerical outlier. |
lb | double | NULL | The lower bound of the values of a numerical outlier. |
ub | double | NULL | The upper bound of the values of a numerical outlier. |
confidence | int | NOT NULL | The confidence score of the outlier finding. |
is_outlier | boolean | NULL | Shows whether a finding is an outlier. |
is_historical | boolean | NULL | Shows whether a finding is a historical outlier. |
is_topn | boolean | NULL | Shows whether a finding is a topN value. |
is_botn | boolean | NULL | Shows whether a finding is a bottomN value. |
source_date | varchar | NULL | The date when an outlier occurred. |
frequency | long | NULL | The number of occurrences of an outlier. |
percentile | double | NULL | The threshold that helps identify extreme outlier values. |
link_id | varchar | NOT NULL | The link ID value. For example, if from an NYSE dataset you assign one column called "Symbol" and another column called "Trade Date" as Link IDs, then you might see a value in a format such as DST~|2018-01-18. |
Shapes
Column name | Type | Default | Description |
---|---|---|---|
key | long | auto increment | The automatically generated number that identifies your break record. |
job_uuid | uuid | NOT NULL | The automatically generated number that uniquely identifies your job. For example, 19745. |
dataset | varchar | NOT NULL | The name of the dataset. |
column_name | varchar | NOT NULL | The name of the column with shape findings. |
shape_type | varchar | NOT NULL | The data type of the shape finding. For example, string. |
shape_value | varchar | NOT NULL | The column value of the shape finding. |
shape_count | int | NOT NULL | The number of columns that conform to the shape that Collibra DQ discovered. |
shape_rate | double | NOT NULL | The percentage a shape conforms to the format that Collibra DQ identifies as normal. |
shape_length | int | NOT NULL | The format of the shape. For example, "xxx & xxxx" could represent the street "3rd & Main" |
link_id | varchar | NOT NULL | The link ID value. For example, if from an NYSE dataset you assign one column called "Symbol" and another column called "Trade Date" as Link IDs, then you might see a value in a format such as DST~|2018-01-18. |
Duplicates
Column name | Type | Default | Description |
---|---|---|---|
key | long | auto increment | The automatically generated number that identifies your break record. |
updts | timestamp | default | The timestamp of when the job ran. |
job_uuid | varchar | NOT NULL | The automatically generated number that uniquely identifies your job. For example, 19745. |
dataset | varchar | NOT NULL | The name of the dataset that contains a possible duplicate. |
run_id | timestamp | NOT NULL | The timestamp of when the job ran. |
key_id | integer | NOT NULL | A unique identifier for each duplicate record that is unique even for similar duplicates found in other runs. Each time a job run, new key values are created for each duplicate record. |
dupe1 | varchar | NOT NULL | The name of the column that Collibra DQ identifies as a duplicate value of dupe2. |
dupe1_link_id | varchar | NOT NULL | The link ID value. For example, if from an NYSE dataset you assign one column called "Symbol" and another column called "Trade Date" as Link IDs, then you might see a value in a format such as DST~|2018-01-18. |
dupe2 | varchar | NULL | The name of the column that Collibra DQ identifies as a duplicate value of dupe1. |
dupe2_link_id | varchar | NULL | The link ID value. For example, if from an NYSE dataset you assign one column called "Symbol" and another column called "Trade Date" as Link IDs, then you might see a value in a format such as DST~|2018-01-18. |
score | integer | NOT NULL | The duplicate score of the column where Collibra DQ detects a potential duplicate. |
Unlike in Pullup mode, when Collibra DQ detects a duplicate value in a Pushdown job with Link IDs assigned, all columns included in the source query of the row where the break is detected display on the Findings page. In Pullup mode, only Link ID columns display on the Findings page when a duplicate value is detected, because the data_preview Metastore table only stores Link ID columns.
Limitations
- Rule break record archival for complex rules, such as those with secondary datasets, is not supported. Rules that reference secondary datasets cannot share column names, because Collibra DQ cannot programmatically identify the table from which these columns come.
- A workaround is to avoid tables with these shared column names or alias on the columns.