Archiving break records from Pushdown jobs

Collibra Data Quality & Observability Pushdown uses SQL queries native to the data source in order to connect directly to the source system and submit the data quality workload. This gives Collibra DQ the unique opportunity to store break records that are typically written to the PostgreSQL Metastore in a database or schema specifically allocated for Collibra DQ. By archiving break records alongside the database or schema, you can:

  • Eliminate network traffic costs for better performance.
  • Reduce storage limit restrictions.
  • Improve security by generating break records that never leave the source system.
  • Enhance your reporting abilities by removing the need to migrate break records from the PostgreSQL Metastore to your data source.
  • Join datasets against your break records.

Note When Archive Break Records is enabled, a limited number of break records are still stored in the PostgreSQL Metastore to power the DQ web application. To prevent break record storage in the Metastore, open the settings modal on the Explorer page of your dataset and turn on Data Preview from Source. This prevents records from leaving your data source and removes any others from the Metastore.

The Archive Break Records component uses a workload statement in the form of a data quality query to generate break records that get pushed into the source system for processing. Instead of returning a result set, the workload statement redirects its output to a table in your data source schema or database that is specifically allocated for Collibra DQ. The records written to the data source table contain source record linkage details that can be used to identify the source system record. These are called Link IDs and are required to write break records to the data source. Additionally, these records contain metadata generated by Collibra DQ that you may find useful when reviewing your break records.

When a workload is complete, a query is generated and stored in the data source, which allows you to easily access your break records. A results sample of break records may be optionally retrieved and used to generate a limited data preview of break records in the DQ Metastore.

pushdown archive break record diagram

Setting up break record archival

To automatically write break records to a database or schema in your data source, you first need to configure your connection's output location and enable Archive Break Records from the Settings modal on the Explorer page.

Compatible Pushdown connections

The following Pushdown connections support the archival of break record:

  • Snowflake
  • Databricks
  • BigQuery
  • Athena
  • Redshift
  • Trino
  • SQL Server
  • SAP HANA

    Note Because Pushdown for BigQuery is currently available as a Public beta offering, only rule break records can be archived at this time.


Prerequisites

  • You have an active Pushdown connection.
  • The service account user has write access to archive break records to the database or schema.

    Important 
    Data sources may have different combination of roles and permissions to allow the creation of break record tables and the ability to write records to them. Therefore, we recommend you review the connection requirements specific to your data source when you use Archive Break Records.

Steps

  1. Click the icon and then click Connections.
    The Connections Management page opens.
  2. Select a Pushdown compatible data source.
  3. The Add/Edit Connection modal opens.
  4. Click Existing Connections at the bottom of the connection template of your data source, then select your connection.
  5. Select the Archive Break Records option.
  6. The Archive Break Records Location appears below it.
  7. In the Archive Break Records Location input field, specify a schema output location in your data source for break records to send.
  8. Note If you enter an incorrect schema, Collibra DQ automatically uses the default output location, PUBLIC, instead.

    Tip 
    You can enter database.schema when your schema resides outside of the current database. For example, db1.PUBLIC, db2.marketing, and db3.sales.

    Because it is optional to specify a database, if you only specify the schema, the database defaults to your current database.

    Example screenshot...archive break records connection setup example

  9. Click Save.
  10. Open explorer icon Explorer and connect to your Pushdown data source.
  11. Select a scanning method.
  12. In the Select Columns step, assign a Link ID to a column.
  13. Important To get break records, it's important to specify a Link ID. The column you assign as Link ID should not contain NULL values and its values should be unique.

  14. In the lower left corner, click Settings.
  15. Select the Archive Break Records option.
  16. By default, turning on Archive Break Records enables the export of break record for all available layers.

    Example screenshot...archive break records settings

  17. Click Save.
  18. Set up and run your DQ job.
  19. When a record breaks, its metadata exports automatically to the data source.

Viewing break record details

You can archive break records from the following layers of Pushdown jobs:

  • Rules
  • Outliers
  • Shapes
  • Duplicates

The records for each of these layers contains a unique set of metadata associated with the particular break type. When Archive Break Records is turned on, Collibra DQ sends this metadata to the break record table in the data source.

Rules

Column Name Type Default Description
key long auto increment The automatically generated number that identifies your break record.
job_uuid uuid NOT NULL The automatically generated number that uniquely identifies your job. For example, 19745.
dataset varchar NOT NULL The name of the dataset.
rule_name varchar NOT NULL The name of the rule.
exception varchar NULL The exception message of a rule that is neither Passing nor Breaking.
link_id varchar NOT NULL The link ID value. For example, if from an NYSE dataset you assign one column called "Symbol" and another column called "Trade Date" as Link IDs, then you might see a value in a format such as DST~|2018-01-18.

Outliers

Column Name Type Default Description
key long auto increment The automatically generated number that identifies your break record.
job_uuid uuid NOT NULL The automatically generated number that uniquely identifies your job. For example, 19745.
dataset varchar NOT NULL The name of the dataset.
id int NOT NULL The job ID associated with the break record.
key_column varchar NULL When a key column is assigned, this is the name of that column.
key_value varchar NULL When a key column is assigned, this is the value in that column.
value_column varchar NOT NULL The name of the column that contains the outlier.
value varchar NOT NULL The value of the outlier finding.
date_value date NULL When a date column is assigned, this is the date column value.
type varchar NOT NULL Indicates whether the outlier finding is categorical or numerical.
median varchar NULL The median of the value of a numerical outlier.
lb double NULL The lower bound of the values of a numerical outlier.
ub double NULL The upper bound of the values of a numerical outlier.
confidence int NOT NULL The confidence score of the outlier finding.
is_outlier boolean NULL Shows whether a finding is an outlier.
is_historical boolean NULL Shows whether a finding is a historical outlier.
is_topn boolean NULL Shows whether a finding is a topN value.
is_botn boolean NULL Shows whether a finding is a bottomN value.
source_date varchar NULL The date when an outlier occurred.
frequency long NULL The number of occurrences of an outlier.
percentile double NULL The threshold that helps identify extreme outlier values.
link_id varchar NOT NULL The link ID value. For example, if from an NYSE dataset you assign one column called "Symbol" and another column called "Trade Date" as Link IDs, then you might see a value in a format such as DST~|2018-01-18.

Shapes

Column Name Type Default Description
key long auto increment The automatically generated number that identifies your break record.
job_uuid uuid NOT NULL The automatically generated number that uniquely identifies your job. For example, 19745.
dataset varchar NOT NULL The name of the dataset.
column_name varchar NOT NULL The name of the column with shape findings.
shape_type varchar NOT NULL The data type of the shape finding. For example, string.
shape_value varchar NOT NULL The column value of the shape finding.
shape_count int NOT NULL The number of columns that conform to the shape that Collibra DQ discovered.
shape_rate double NOT NULL The percentage a shape conforms to the format that Collibra DQ identifies as normal.
shape_length int NOT NULL The format of the shape. For example, "xxx & xxxx" could represent the street "3rd & Main"
link_id varchar NOT NULL The link ID value. For example, if from an NYSE dataset you assign one column called "Symbol" and another column called "Trade Date" as Link IDs, then you might see a value in a format such as DST~|2018-01-18.

Duplicates

Column Name Type Default Description
key long auto increment The automatically generated number that identifies your break record.
updts timestamp default The timestamp of when the job ran.
job_uuid varchar NOT NULL The automatically generated number that uniquely identifies your job. For example, 19745.
dataset varchar NOT NULL The name of the dataset that contains a possible duplicate.
run_id timestamp NOT NULL The timestamp of when the job ran.
key_id integer NOT NULL A unique identifier for each duplicate record that is unique even for similar duplicates found in other runs. Each time a job run, new key values are created for each duplicate record.
dupe1 varchar NOT NULL The name of the column that Collibra DQ identifies as a duplicate value of dupe2.
dupe1_link_id varchar NOT NULL The link ID value. For example, if from an NYSE dataset you assign one column called "Symbol" and another column called "Trade Date" as Link IDs, then you might see a value in a format such as DST~|2018-01-18.
dupe2 varchar NULL The name of the column that Collibra DQ identifies as a duplicate value of dupe1.
dupe2_link_id varchar NULL The link ID value. For example, if from an NYSE dataset you assign one column called "Symbol" and another column called "Trade Date" as Link IDs, then you might see a value in a format such as DST~|2018-01-18.
score integer NOT NULL The duplicate score of the column where Collibra DQ detects a potential duplicate.

Unlike in Pullup mode, when Collibra DQ detects a duplicate value in a Pushdown job with Link IDs assigned, all columns included in the source query of the row where the break is detected display on the Findings page. In Pullup mode, only Link ID columns display on the Findings page when a duplicate value is detected, because the data_preview Metastore table only stores Link ID columns.

Limitations

  • Rule break record archival for complex rules, such as those with secondary datasets, is not supported. Rules that reference secondary datasets cannot share column names, because Collibra DQ cannot programmatically identify the table from which these columns come.
    • A workaround is to avoid tables with these shared column names or alias on the columns.