Archiving break records from Pushdown jobs

Collibra Data Quality & Observability Pushdown uses SQL queries native to the data source in order to connect directly to the source system and submit the data quality workload. This gives Collibra DQ the unique opportunity to store break records that are typically written to the PostgreSQL Metastore in a database or schema specifically allocated for Collibra DQ. By archiving break records alongside the database or schema, you can:

  • Eliminate network traffic costs for better performance.
  • Reduce storage limit restrictions.
  • Improve security by generating break records that never leave the source system.
  • Enhance your reporting abilities by removing the need to migrate break records from the PostgreSQL Metastore to your data source.
  • Join datasets against your break records.

Note When Archive Break Records is enabled, a limited number of break records are still stored in the PostgreSQL Metastore to power the DQ web application. To prevent break record storage in the Metastore, open the settings modal on the Explorer page of your dataset and turn on Data Preview from Source. This prevents records from leaving your data source and removes any others from the Metastore.

The Archive Break Records component uses a workload statement in the form of a data quality query to generate break records that get pushed into the source system for processing. Instead of returning a result set, the workload statement redirects its output to a table in your data source schema or database that is specifically allocated for Collibra DQ. The records written to the data source table contain source record linkage details that can be used to identify the source system record. These are called Link IDs and are required to write break records to the data source. Additionally, these records contain metadata generated by Collibra DQ that you may find useful when reviewing your break records.

When a workload is complete, a query is generated and stored in the data source, which allows you to easily access your break records. A results sample of break records may be optionally retrieved and used to generate a limited data preview of break records in the DQ Metastore.

pushdown archive break record diagram

Setting up break record archival

To automatically write break records to a database or schema in your data source, you first need to configure your connection's output location and enable Archive Break Records from the Settings modal on the Explorer page.

Compatible Pushdown connections

The following Pushdown connections support the archival of break records:

  • Snowflake
  • Databricks
  • BigQuery
  • Athena
  • Redshift
  • Trino
  • SQL Server
  • SAP HANA

    Note Because Pushdown for BigQuery is currently available as a Public beta offering, only rule break records can be archived at this time.


Prerequisites

  • You have an active Pushdown connection.
  • The service account user has write access to the database or schema to archive break records.

    Important 
    Data sources may have different combinations of roles and permissions to allow the creation of break record tables and the ability to write records to them. Therefore, we recommend you review the connection requirements specific to your data source when you use Archive Break Records.

Specify the archive location for the Connection

  1. Open Connection Management and edit your connection.
  2. In the Edit Connection dialog box, select Archive Break Records.
  3. Specify the schema to archive break records in the Archive Break Records Location field. Keep in mind the following tips for selecting the schema:
    • If you enter an incorrect schema, Collibra DQ automatically uses the default output location, PUBLIC, instead.
    • Because it is optional to specify a database, if you only specify the schema, the database defaults to your current database.
    • You can enter database.schema when your schema resides outside of the current database. For example, db1.PUBLIC, db2.marketing, and db3.sales.
    • After you select the archive location, Collibra DQ creates the following tables in the database to store the break records:
      • COLLIBRA_DQ_BREAKS
      • COLLIBRA_DQ_DUPLICATES
      • COLLIBRA_DQ_OUTLIERS
      • COLLIBRA_DQ_RULES
      • COLLIBRA_DQ_SHAPES

Example screenshot...archive break records connection setup example

dgc-docs/Content/Resources/Images/DataQuality/DQ202209/dq-connections-snowflake-pushdown.png

 

Create the job

  1. Create a DQ job as described in Create a DQ Job.

  2. In the Select Columns step, assign a Link ID to a column.
  3. Important To get break records, it's important to specify a Link ID. The column you assign as Link ID should not contain NULL values and its values should be unique, most commonly the primary key. Composite primary key is also supported.

  4. In the lower left corner, click Settings.
  5. Select the Archive Break Records option.
  6. By default, turning on Archive Break Records enables the export of break records for all available layers.

    Example screenshot...archive break records settings

  7. Click Save.
  8. Set up and run your DQ job.
  9. When a record breaks, its metadata exports automatically to the data source.

Viewing break record details

You can archive break records from the following layers of Pushdown jobs:

LayerDatabase table
RulesCOLLIBRA_DQ_RULES
OutliersCOLLIBRA_DQ_OUTLIERS
ShapesCOLLIBRA_DQ_SHAPES
DuplicatesCOLLIBRA_DQ_DUPLICATES

The records for each of these layers contains a unique set of metadata associated with the particular break type. When Archive Break Records is turned on, Collibra DQ sends this metadata to the break record table in the data source.

Rules

Column NameTypeDefaultDescription
keylongauto incrementThe automatically generated number that identifies your break record.
job_uuiduuidNOT NULLThe automatically generated number that uniquely identifies your job. For example, 19745.
datasetvarcharNOT NULLThe name of the dataset.
rule_namevarcharNOT NULLThe name of the rule.
exceptionvarcharNULLThe exception message of a rule that is neither Passing nor Breaking.
link_idvarcharNOT NULLThe link ID value. For example, if from an NYSE dataset you assign one column called "Symbol" and another column called "Trade Date" as Link IDs, then you might see a value in a format such as DST~|2018-01-18.

Outliers

Column NameTypeDefaultDescription
keylongauto incrementThe automatically generated number that identifies your break record.
job_uuiduuidNOT NULLThe automatically generated number that uniquely identifies your job. For example, 19745.
datasetvarcharNOT NULLThe name of the dataset.
idintNOT NULLThe job ID associated with the break record.
key_columnvarcharNULLWhen a key column is assigned, this is the name of that column.
key_valuevarcharNULLWhen a key column is assigned, this is the value in that column.
value_columnvarcharNOT NULLThe name of the column that contains the outlier.
valuevarcharNOT NULLThe value of the outlier finding.
date_valuedateNULLWhen a date column is assigned, this is the date column value.
typevarcharNOT NULLIndicates whether the outlier finding is categorical or numerical.
medianvarcharNULLThe median of the value of a numerical outlier.
lbdoubleNULLThe lower bound of the values of a numerical outlier.
ubdoubleNULLThe upper bound of the values of a numerical outlier.
confidenceintNOT NULLThe confidence score of the outlier finding.
is_outlierbooleanNULLShows whether a finding is an outlier.
is_historicalbooleanNULLShows whether a finding is a historical outlier.
is_topnbooleanNULLShows whether a finding is a topN value.
is_botnbooleanNULLShows whether a finding is a bottomN value.
source_datevarcharNULLThe date when an outlier occurred.
frequencylongNULLThe number of occurrences of an outlier.
percentiledoubleNULLThe threshold that helps identify extreme outlier values.
link_idvarcharNOT NULLThe link ID value. For example, if from an NYSE dataset you assign one column called "Symbol" and another column called "Trade Date" as Link IDs, then you might see a value in a format such as DST~|2018-01-18.

Shapes

Column NameTypeDefaultDescription
keylongauto incrementThe automatically generated number that identifies your break record.
job_uuiduuidNOT NULLThe automatically generated number that uniquely identifies your job. For example, 19745.
datasetvarcharNOT NULLThe name of the dataset.
column_namevarcharNOT NULLThe name of the column with shape findings.
shape_typevarcharNOT NULLThe data type of the shape finding. For example, string.
shape_valuevarcharNOT NULLThe column value of the shape finding.
shape_countintNOT NULLThe number of columns that conform to the shape that Collibra DQ discovered.
shape_ratedoubleNOT NULLThe percentage a shape conforms to the format that Collibra DQ identifies as normal.
shape_lengthintNOT NULLThe format of the shape. For example, "xxx & xxxx" could represent the street "3rd & Main"
link_idvarcharNOT NULLThe link ID value. For example, if from an NYSE dataset you assign one column called "Symbol" and another column called "Trade Date" as Link IDs, then you might see a value in a format such as DST~|2018-01-18.

Duplicates

Column NameTypeDefaultDescription
keylongauto incrementThe automatically generated number that identifies your break record.
updtstimestampdefaultThe timestamp of when the job ran.
job_uuidvarcharNOT NULLThe automatically generated number that uniquely identifies your job. For example, 19745.
datasetvarcharNOT NULLThe name of the dataset that contains a possible duplicate.
run_idtimestampNOT NULLThe timestamp of when the job ran.
key_idintegerNOT NULLA unique identifier for each duplicate record that is unique even for similar duplicates found in other runs. Each time a job run, new key values are created for each duplicate record.
dupe1varcharNOT NULLThe name of the column that Collibra DQ identifies as a duplicate value of dupe2.
dupe1_link_idvarcharNOT NULLThe link ID value. For example, if from an NYSE dataset you assign one column called "Symbol" and another column called "Trade Date" as Link IDs, then you might see a value in a format such as DST~|2018-01-18.
dupe2varcharNULLThe name of the column that Collibra DQ identifies as a duplicate value of dupe1.
dupe2_link_idvarcharNULLThe link ID value. For example, if from an NYSE dataset you assign one column called "Symbol" and another column called "Trade Date" as Link IDs, then you might see a value in a format such as DST~|2018-01-18.
scoreintegerNOT NULLThe duplicate score of the column where Collibra DQ detects a potential duplicate.

Unlike in Pullup mode, when Collibra DQ detects a duplicate value in a Pushdown job with Link IDs assigned, all columns included in the source query of the row where the break is detected display on the Findings page. In Pullup mode, only Link ID columns display on the Findings page when a duplicate value is detected, because the data_preview Metastore table only stores Link ID columns.

Limitations

  • Rule break record archival for complex rules, such as those with secondary datasets, is not supported. Rules that reference secondary datasets cannot share column names, because Collibra DQ cannot programmatically identify the table from which these columns come.
    • A workaround is to avoid tables with these shared column names or alias on the columns.