Archiving break records from Pullup jobs
The ability to archive break records allows you to automatically export CSV files containing the break records of DQ jobs on JDBC connections to external storage containers, such as Amazon S3 buckets. By offloading break records to cloud storage, you have more control over how you manage the results of your DQ jobs, and you can store data in the supported remote connection of your choice. Additionally, this relieves potential capacity issues of the PostgreSQL Metastore where, when jobs with link IDs back to their source return break records, the large amount of data included in the records can overload the PostgreSQL Metastore rule_breaks table and lead to database crashes.
You can send break records to the following remote file storage providers:
Important When archive breaking records is turned on, rule break records no longer write to the PostgreSQL Metastore.
Note When you leverage the Spark 3.2.2 standalone installation that comes with the AWS marketplace installation, be aware that the existing hadoop-aws-3.2.1.jar is incompatible with the AWS S3 Archive feature. To update the hadoop-aws-3.2.1.jar to hadoop-aws-3.3.1.jar in the spark/jars directory, obtain the file from the following link: Apache Downloads.
Contact your CS or SE for assistance if you encounter any difficulties locating the jar file.
Break record details
Pullup jobs can currently export the break records of rules to your cloud storage service in a CSV file so you can manage it outside of Collibra Data Quality & Observability.
This table shows an example of the rule record metadata included in the export file.
Dataset | Run Id | Rule Name | Link Id |
---|---|---|---|
ORACLE_DQUSER.NYSE_001 | Tue Jan 16 00:00:00 GMT 2018 | SQLG_VOLUME_001 | ES~|2079300.0000 |
ORACLE_DQUSER.NYSE_001 | Tue Jan 16 00:00:00 GMT 2018 | SQLG_VOLUME_001 | ESE~|252600.0000 |
ORACLE_DQUSER.NYSE_001 | Tue Jan 16 00:00:00 GMT 2018 | SQLG_VOLUME_001 | ESL~|410200.0000 |
Prerequisites
The following table shows the available external storage options and the requirements for each.
Storage option | Prerequisites |
---|---|
Amazon S3 |
|
ADLS |
|
Azure Blob |
|
Google Cloud Storage (GCS) |
|
Steps
- From Explorer, connect to a data source.
- Select a scanning method.
- In the Select Columns step, assign a Link ID to a column.
- In the lower left corner, click Settings.
- Under the Data Quality Job section, select the Archive Breaking Records option.
- From the Archive Breaking Records dropdown menu, select the external storage option to which break records will send.
- Click Save.
- Set up and run your DQ job.
- When a record breaks, its metadata exports automatically to your external storage service.
Important Ensure that a column is assigned as the Link ID for external break record archival to work properly.