DQ job settings

You can use the Settings modal to fine-tune your DQ job.

Using the Settings modal

Click Settings below the Run button to open the Settings modal.

Option Description
Profile
Profile String Length

Ensures string-type data fits within the predefined schemas of its target data sources.

Data Analysis
Relationship Analysis

Relationship analysis lets you discover relationships in your data and measures the strength of those relationships with a correlation matrix.

Relationship Analysis is set to Auto by default.

Histogram Analysis

Segments data from your DQ Job with histograms.

Histogram Analysis is set to Auto by default.

AdaptiveRules
Data Lookback

The number of past DQ Job runs for your learning model to analyze.

The default value is 10.

Learning Phase

The minimum number of DQ Job runs required for behavioral scoring begins to calculate.

The default value is 4.

Parallel JDBC
Override Partition Column

Splits your selected column evenly for parallel JDBC Spark load execution.

Select the checkbox option, then select a column from the dropdown menu.

The default option when Override Partition Column is selected is OWLAUTOJDBC.

No. of Partitions

A partition is an even split of the total number of rows in your record. For large DQ jobs, increasing the number of partitions can improve performance and increase processing efficiency.

Drag the slider or enter a value between 2 and 20 in the input field.

Example If the row count of your table is 10 million, set the number of partitions to 10 to divide the record evenly into 10 partitioned blocks of 1 million rows. The job then executes the 10 blocks concurrently in parallel.

Data Quality Job
Metastore Host
The PostgreSQL metastore connection URL that determines which Metastore to use to register and record the results of your job.
Logging

Indicates the log level.

Select an option from the dropdown menu.

The default is Info.

Additional Lib
A directory path to include any additional drivers or jars in the classpath.
Union LookBack Min. Row

Indicates where to create the historical context of a scan with union lookback configured based on the number of rows in the preceding scans.

If you are using union lookback, enter a value based on the number of recorded rows from previous scans.

Example To exclude scans that recorded less than 10 rows from the historical load context, enter a value of 10.

Archive Breaking Records

The external storage container to which rule break records export in CSV format.

Select the checkbox option, then select an archive location from the dropdown menu.

Important When archive breaking records is turned on, rule break records no longer write to the PostgreSQL metastore.

Note For more information, see the Archive Breaking Records section.

Check header
Excludes schema findings from the results of a DQ job. This is for when your schema contains special characters in its column names.
Core Fetch Mode
Overrides the -q in the command line by adding -corefetchmode to the command line, which allows the core to fetch the query from the load options table.
Option Description
Profile
Profiling

Creates a baseline sketch of your table or file over time.

Profiling is on by default.

Advanced Profile
Determines whether a string field contains various string numerics, calculates TopN, BottomN, and TopN Shapes, and detects the scale and precision of double fields.
Data Analysis
Relationship Analysis

Lets you discover relationships in your data and measures the strength of those relationships with a correlation matrix.

Relationship Analysis is set to Auto by default.

Histogram Analysis

Segments data from your DQ Job with histograms.

Histogram Analysis is set to Auto by default.

AdaptiveRules
Data Lookback

The number of past DQ Job runs for your learning model to analyze.

The default value is 10.

Learning Phase

The minimum number of DQ Job runs required for behavioral scoring begins to calculate.

The default value is 4.

Archive Break Records
Data Preview from Source

Prevents data preview records from storing on the PostgreSQL Metastore. When you select this option:

  • All data preview records are removed from the PostgreSQL Metastore and everything else remains in your data source.
  • When you view data preview records in the web application, the state of the records reflects how they currently appear in your data source.

This option strengthens security by completely removing sensitive data from the PostgreSQL Metastore.

Archive Dupes Break Records

Allows the storage of dupe break records to the source system instead of the PostgreSQL Metastore.

Archive Outliers Break Records
Allows the storage of outlier break records to the source system instead of the PostgreSQL Metastore.
Archive Rules Break Records
Allows the storage of rule break records to the source system instead of the PostgreSQL Metastore.
Archive Shapes Break Records
Allows the storage of shapes break records to the source system instead of the PostgreSQL Metastore.
Source Output Schema

An alternative destination schema to create tables for break records storage instead of the schema provided in the connection. This can be either the database.schema or the schema and requires write access to the source output schema location.

Logging
SQL Logging

Switches logging for all SQL queries on and off.

SQL logging for all jobs is off by default.

Pushdown
No. of Connections

The maximum number of connections to the data source to run your DQ Job. Using multiple connections lets your DQ Job execute queries in parallel.

The default value is 10. This default value is determined by the maxconcurrentjobs limit on the Admin Limits page.

If you use the default of 10, 10 DQ Jobs opens 100 connections (10 DQ Job Jobs multiplied by 10 connections) to the data source.

Because the maximum number of parallel connections varies depending on the data source, you should account for the specific constraints of your data source when setting the number of connections.

When setting the number of available connections, it's important to consider the number of open connections for your data source and the number of DQ Jobs you intend to run in parallel.

Tip 
When running multiple DQ Jobs in parallel and you have a large data warehouse, you can increase the number of available connections for the DQ Job to which you wish to grant processing priority in the warehouse.

Other scenarios where you may consider increasing the number of connections include:

  • The Job log shows that a particular activity is taking an unusually long amount of time to process.
  • A DQ Job fails with an exception message referring to the failure of a particular DQ layer, such as outliers.
  • Memory consumption is peaking in the data source.
  • You are not getting the maximum throughput of your data warehouse.
  • Your data warehouse is answering queries but not processing the DQ Job.

No. of Threads

The maximum number of threads a DQ layer can run in parallel. Use this parameter to divide the number of open connections between DQ layers. For instance, when a DQ Job has 10 connections and 2 DQ layers to process, you can set the number of threads to 5 to distribute the Job load evenly.

The default value is 2.

Tip 
You can consider adjusting the number of threads based on the total DQ layers in your DQ Job. For example, if your DQ Job includes checks for duplicates, outliers, rules, and shapes, increasing the number of available threads may improve the run time.

Run Date
Date Format

The run date format (${rd} or ${rdEnd}) substituted on the command line at runtime. The option you select should match the date or datetime format of the timeslice column that you specified in the Time Slice option of the Select rows step.