DQ job settings
You can use the Settings modal to fine-tune your DQ job.
Using the Settings modal
Click Settings below the Run button to open the Settings modal.
- Pullup
- Pushdown
Option | Description |
---|---|
Profile | |
Profile String Length
|
Ensures string-type data fits within the predefined schemas of its target data sources. |
Data Analysis | |
Relationship Analysis
|
Relationship analysis lets you discover relationships in your data and measures the strength of those relationships with a correlation matrix. Relationship Analysis is set to Auto by default. |
Histogram Analysis
|
Segments data from your DQ Job with histograms. Histogram Analysis is set to Auto by default. |
AdaptiveRules | |
Data Lookback
|
The number of past DQ Job runs for your learning model to analyze. The default value is 10. |
Learning Phase
|
The minimum number of DQ Job runs required for behavioral scoring begins to calculate. The learning phase determines the baseline on the Profile page. The default value is 4. |
Parallel JDBC | |
Override Partition Column
|
Splits your selected column evenly for parallel JDBC Spark load execution. Select the checkbox option, then select a column from the dropdown menu. The default option when Override Partition Column is selected is OWLAUTOJDBC. |
No. of Partitions
|
A partition is an even split of the total number of rows in your record. For large DQ jobs, increasing the number of partitions can improve performance and increase processing efficiency. Drag the slider or enter a value between 2 and 20 in the input field. Example If the row count of your table is 10 million, set the number of partitions to 10 to divide the record evenly into 10 partitioned blocks of 1 million rows. The job then executes the 10 blocks concurrently in parallel. |
Data Quality Job | |
Metastore Host
|
The PostgreSQL metastore connection URL that determines which Metastore to use to register and record the results of your job. |
Logging
|
Indicates the log level. Select an option from the dropdown menu. The default is Info. |
Additional Lib
|
A directory path to include any additional drivers or jars in the classpath. |
Union LookBack Min. Row
|
Indicates where to create the historical context of a scan with union lookback configured based on the number of rows in the preceding scans. If you are using union lookback, enter a value based on the number of recorded rows from previous scans. Example To exclude scans that recorded less than 10 rows from the historical load context, enter a value of 10. |
Archive Breaking Records
|
The external storage container to which rule break records export in CSV format. Select the checkbox option, then select an archive location from the dropdown menu. Important When archive breaking records is turned on, rule break records no longer write to the PostgreSQL metastore. Note For more information, see the Archive Breaking Records section. |
Check header
|
Excludes schema findings from the results of a DQ job. This is for when your schema contains special characters in its column names. |
Core Fetch Mode
|
Overrides the -q in the command line by adding -corefetchmode to the command line, which allows the core to fetch the query from the load options table. |
Option | Description |
---|---|
Profile | |
Profiling
|
Creates a baseline sketch of your table or file over time. Profiling is on by default. |
Advanced Profile
|
Determines whether a string field contains various string numerics, calculates TopN, BottomN, and TopN Shapes, and detects the scale and precision of double fields. |
Data Analysis | |
Relationship Analysis
|
Lets you discover relationships in your data and measures the strength of those relationships with a correlation matrix. Relationship Analysis is set to Auto by default. |
Histogram Analysis
|
Segments data from your DQ Job with histograms. Histogram Analysis is set to Auto by default. |
AdaptiveRules | |
Data Lookback
|
The number of past DQ Job runs for your learning model to analyze. The default value is 10. |
Learning Phase
|
The minimum number of DQ Job runs required for behavioral scoring begins to calculate. The default value is 4. |
Archive Break Records | |
Data Preview from Source
|
Prevents data preview records from storing on the PostgreSQL Metastore. When you select this option:
This option strengthens security by completely removing sensitive data from the PostgreSQL Metastore. |
Archive Dupes Break Records
|
Allows the storage of dupe break records to the source system instead of the PostgreSQL Metastore. |
Archive Outliers Break Records
|
Allows the storage of outlier break records to the source system instead of the PostgreSQL Metastore. |
Archive Rules Break Records
|
Allows the storage of rule break records to the source system instead of the PostgreSQL Metastore. |
Archive Shapes Break Records
|
Allows the storage of shapes break records to the source system instead of the PostgreSQL Metastore. |
Source Output Schema
|
An alternative destination schema to create tables for break records storage instead of the schema provided in the connection. This can be either the database.schema or the schema and requires write access to the source output schema location. |
Logging | |
SQL Logging
|
Switches logging for all SQL queries on and off. SQL logging for all jobs is off by default. |
Pushdown | |
No. of Connections
|
The maximum number of connections to the data source to run your DQ Job. Using multiple connections lets your DQ Job execute queries in parallel. The default value is 10. This default value is determined by the maxconcurrentjobs limit on the Admin Limits page. If you use the default of 10, 10 DQ Jobs opens 100 connections (10 DQ Job Jobs multiplied by 10 connections) to the data source. Because the maximum number of parallel connections varies depending on the data source, you should account for the specific constraints of your data source when setting the number of connections. When setting the number of available connections, it's important to consider the number of open connections for your data source and the number of DQ Jobs you intend to run in parallel. Tip
|
No. of Threads
|
The maximum number of threads a DQ layer can run in parallel. Use this parameter to divide the number of open connections between DQ layers. For instance, when a DQ Job has 10 connections and 2 DQ layers to process, you can set the number of threads to 5 to distribute the Job load evenly. The default value is 2. Tip
|
Run Date | |
Date Format
|
The run date format ( |