DQ job settings

You can use the Settings modal to fine-tune your DQ job.

Using the Settings modal

Click Settings below the Run button to open the Settings modal.

Pullup
Pushdown

Option	Description
Profile
Profile String Length	Ensures string-type data fits within the predefined schemas of its target data sources.
Data Analysis
Relationship Analysis	Relationship analysis lets you discover relationships in your data and measures the strength of those relationships with a correlation matrix. Relationship Analysis is set to Auto by default.
Histogram Analysis	Segments data from your DQ Job with histograms. Histogram Analysis is set to Auto by default.
AdaptiveRules
Data Lookback	The total number of prior job runs for your learning model to analyze. The lookback period determines the baseline on the Profile page. The default value is 10.
Learning Phase	The minimum number of job runs required before adaptive rule findings are generated. When the minimum number of job runs hasn't been met, the adaptive rule shows a "learning" state on the Findings page. The default value is 4.
Parallel JDBC
Override Partition Column	Splits your selected column evenly for parallel JDBC Spark load execution. Select the checkbox option, then select a column from the drop-down list. The default option when Override Partition Column is selected is OWLAUTOJDBC.
No. of Partitions	A partition is an even split of the total number of rows in your record. For large DQ jobs, increasing the number of partitions can improve performance and increase processing efficiency. Drag the slider or enter a value between 2 and 20 in the input field. Example If the row count of your table is 10 million, set the number of partitions to 10 to divide the record evenly into 10 partitioned blocks of 1 million rows. The job then executes the 10 blocks concurrently in parallel.
Data Quality Job
Metastore Host	The PostgreSQL metastore connection URL that determines which Metastore to use to register and record the results of your job.
Logging	Indicates the log level. Select an option from the drop-down list. The default is Info.
Additional Lib	A directory path to include any additional drivers or jars in the classpath.
Union LookBack Min. Row	Indicates where to create the historical context of a scan with union lookback configured based on the number of rows in the preceding scans. If you are using union lookback, enter a value based on the number of recorded rows from previous scans. Example To exclude scans that recorded less than 10 rows from the historical load context, enter a value of 10.
Archive Breaking Records	The external storage container to which rule break records export in CSV format. From the Select Columns step, assign one or more columns as link IDs, then return to the Settings dialog box. Select the Archive Breaking Records checkbox option, then select an archive location from the drop-down list. Note If you do not assign a link ID column from the Select Columns step, the Archive Breaking Records option and drop-down list are unavailable. Important When archive breaking records is turned on, rule break records no longer write to the PostgreSQL metastore. Tip For more information, see the Archive Breaking Records section.
Check header	Excludes schema findings from the results of a DQ job. This is for when your schema contains special characters in its column names.
Core Fetch Mode	Overrides the `-q` in the command line by adding `-corefetchmode` to the command line, which allows the core to fetch the query from the load options table.

Option	Description
Profile
Profiling	Creates a baseline sketch of your table or file over time. Profiling is on by default.
Advanced Profile	Determines whether a string field contains various string numerics, calculates TopN, BottomN, and TopN Shapes, and detects the scale and precision of double fields.
Data Analysis
Relationship Analysis	Lets you discover relationships in your data and measures the strength of those relationships with a correlation matrix. Relationship Analysis is set to Auto by default.
Histogram Analysis	Segments data from your DQ Job with histograms. Histogram Analysis is set to Auto by default.
AdaptiveRules
Data Lookback	The number of past DQ Job runs for your learning model to analyze. The lookback period determines the baseline on the Profile page. The default value is 10.
Learning Phase	The minimum number of job runs required before adaptive rule findings are generated. When the minimum number of job runs hasn't been met, the adaptive rule shows a "learning" state on the Findings page. The default value is 4.
Archive Break Records
Data Preview from Source	Prevents data preview records from storing on the PostgreSQL Metastore. When you select this option: All data preview records are removed from the PostgreSQL Metastore and everything else remains in your data source. When you view data preview records in the web application, the state of the records reflects how they currently appear in your data source. This option strengthens security by completely removing sensitive data from the PostgreSQL Metastore.
Archive Dupes Break Records	Allows the storage of dupe break records to the source system instead of the PostgreSQL Metastore.
Archive Outliers Break Records	Allows the storage of outlier break records to the source system instead of the PostgreSQL Metastore.
Archive Rules Break Records	Allows the storage of rule break records to the source system instead of the PostgreSQL Metastore.
Archive Shapes Break Records	Allows the storage of shapes break records to the source system instead of the PostgreSQL Metastore.
Source Output Schema	An alternative destination schema to create tables for break records storage instead of the schema provided in the connection. This can be either the database.schema or the schema and requires write access to the source output schema location.
Logging
SQL Logging	Switches logging for all SQL queries on and off. SQL logging for all jobs is off by default.
Pushdown
No. of Connections	The maximum number of connections to the data source to run your DQ Job. Using multiple connections lets your DQ Job execute queries in parallel. The default value is 10. This default value is determined by the maxconcurrentjobs limit on the Admin Limits page. If you use the default of 10, 10 DQ Jobs opens 100 connections (10 DQ Job Jobs multiplied by 10 connections) to the data source. Because the maximum number of parallel connections varies depending on the data source, you should account for the specific constraints of your data source when setting the number of connections. When setting the number of available connections, it's important to consider the number of open connections for your data source and the number of DQ Jobs you intend to run in parallel. Tip When running multiple DQ Jobs in parallel and you have a large data warehouse, you can increase the number of available connections for the DQ Job to which you wish to grant processing priority in the warehouse. Other scenarios where you may consider increasing the number of connections include: The Job log shows that a particular activity is taking an unusually long amount of time to process. A DQ Job fails with an exception message referring to the failure of a particular DQ layer, such as outliers. Memory consumption is peaking in the data source. You are not getting the maximum throughput of your data warehouse. Your data warehouse is answering queries but not processing the DQ Job.
No. of Threads	The maximum number of threads a DQ layer can run in parallel. Use this parameter to divide the number of open connections between DQ layers. For instance, when a DQ Job has 10 connections and 2 DQ layers to process, you can set the number of threads to 5 to distribute the Job load evenly. The default value is 2. Tip You can consider adjusting the number of threads based on the total DQ layers in your DQ Job. For example, if your DQ Job includes checks for duplicates, outliers, rules, and shapes, increasing the number of available threads may improve the run time.
Run Date
Date Format	The run date format (`${rd}` or `${rdEnd}`) substituted on the command line at runtime. The option you select should match the date or datetime format of the timeslice column that you specified in the Time Slice option of the Select rows step.