Archiving break records from Pullup jobs

The ability to archive break records allows you to automatically export CSV files containing the break records of DQ jobs on JDBC connections to external storage containers, such as Amazon S3 buckets. By offloading break records to cloud storage, you have more control over how you manage the results of your DQ jobs, and you can store data in the supported remote connection of your choice. Additionally, this relieves potential capacity issues of the PostgreSQL metastore where, when jobs with link IDs back to their source return break records, the large amount of data included in the records can overload the PostgreSQL metastore rule_breaks table and lead to database crashes.

You can send break records to the following remote file storage providers:

architecture of pullup external break record storage

Important When archive breaking records is turned on, rule break records no longer write to the PostgreSQL metastore.

Note When you leverage the Spark 3.2.2 standalone installation that comes with the AWS marketplace installation, be aware that the existing hadoop-aws-3.2.1.jar is incompatible with the AWS S3 Archive feature. To update the hadoop-aws-3.2.1.jar to hadoop-aws-3.3.1.jar in the spark/jars directory, obtain the file from the following link:  Apache Downloads.



Contact your CS or SE for assistance if you encounter any difficulties locating the jar file.

Break record details

Pullup jobs can currently export the break records of rules to your cloud storage service in a CSV file so you can manage it outside of Collibra Data Quality & Observability.

This table shows an example of the rule record metadata included in the export file.

Dataset Run Id Rule Name Link Id
ORACLE_DQUSER.NYSE_001 Tue Jan 16 00:00:00 GMT 2018 SQLG_VOLUME_001 ES~|2079300.0000
ORACLE_DQUSER.NYSE_001 Tue Jan 16 00:00:00 GMT 2018 SQLG_VOLUME_001 ESE~|252600.0000
ORACLE_DQUSER.NYSE_001 Tue Jan 16 00:00:00 GMT 2018 SQLG_VOLUME_001 ESL~|410200.0000

Prerequisites

The following table shows the available external storage options and the requirements for each.

Storage option Prerequisites
Amazon S3
  • An Amazon S3 connection.
  • Read and write access on your Amazon S3 bucket.
  • Minimum required bucket permissions...
    Copy
    {
        "Version": "YYYY-MM-DD",
        "Statement": [
            {
                "Sid": "VisualEditor0",
                "Effect": "Allow",
                "Action": [
                    "s3:ListStorageLensConfigurations",
                    "s3:ListAccessPointsForObjectLambda",
                    "s3:GetAccessPoint",
                    "s3:PutAccountPublicAccessBlock",
                    "s3:GetAccountPublicAccessBlock",
                    "s3:ListAllMyBuckets",
                    "s3:ListAccessPoints",
                    "s3:PutAccessPointPublicAccessBlock",
                    "s3:ListJobs",
                    "s3:PutStorageLensConfiguration",
                    "s3:ListMultiRegionAccessPoints",
                    "s3:CreateJob"
                ],
                "Resource": "*"
            },
            {
                "Sid": "VisualEditor1",
                "Effect": "Allow",
                "Action": "s3:*",
                "Resource": [
                    "arn:aws:s3:::YOURS3BUCKETNAME",
                    "arn:aws:s3:::YOURS3BUCKETNAME/*"
                ]
            }
        ]
    }
ADLS
  • An ADLS connection.
  • Read and write access on your ADLS bucket.
Azure Blob
  • An Azure Blob connection.
  • Read and write access on your Azure Blob bucket.

Steps

  1. From Explorer, connect to a data source.
  2. In the Scope Select, click Add Link Back to Source and assign a Link ID to a column.
  3. Finish setting up your DQ job, then click Save/Run.
  4. On the Config tab, click DQ Job.
  5. Select the Archive Breaks option.
  6. Select an external storage output option from the Archive Connection dropdown menu.
  7. Click Estimate Job.
  8. Click Run.
  9. When a record breaks, its metadata exports automatically to your external storage service.
  1. From Explorer, connect to a data source.
  2. Select a scanning method.
  3. In the Select Columns step, assign a Link ID to a column.
  4. In the lower left corner, click gear iconSettings.
  5. Under the Data Quality Job section, select the Archive Breaking Records option.
  6. From the Archive Breaking Records dropdown menu, select the external storage option to which break records will send.
  7. Click Save.
  8. Set up and run your DQ job.
  9. When a record breaks, its metadata exports automatically to your external storage service.

Important Ensure that a column is assigned as the Link ID for external break record archival to work properly.