Add Secondary Datasets
You can add one or more secondary datasets to a Data Quality Rule. Secondary datasets are useful when you want to compare one table against another table.
For optimal performance, if possible, define and scan a view instead of adding a secondary dataset.
If the dataset is via a Pull Up connection, you can add it from either the same connection type or from another connection type (also referred to as a cross-connection rule). If the dataset is via a Pushdown connection, you can only add secondary datasets from the same connection.
A typical use case is to verify the validity of a column by comparing it to a similar column in another table. For example, a health insurance provider may use proprietary codes that are maintained in a main list. In this case, you would compare the code values in a daily activity table with the code values in the main list.
To compare the id
column in one table to the id
column in another table, the rule may appear as follows:
select * from @table1 where id
not in ( select id from @table2 )
To compare multiple columns from one table to another, the rule may appear as follows:
select id, app_id, email, guid_num from @table1
EXCEPT
select id, app_id, email, guid_num from @table2
There is a known limitation where a Cartesian join which produces a greater number of rows than the initial dataset shows a score greater than 100%. There is not a workaround other than the query for the DQ job be the query that produces that larger number of rows.
For additional information on writing SQL queries, including using Joins, go to Freeform SQL.
Adding a secondary dataset from the same connection
You can add one or more secondary datasets from the same connection type in both Pull Up and Pushdown connection types. This feature is for convenience purposes and not required for rule processing.
- In the Rule Workbench, click the
icon and enter the name of the dataset in the Add Secondary Dataset dialog. For more information, go to Create a data quality rule.
- Add your SQL rule.
- Click Save.
Adding a secondary dataset from a different connection
Pullup connections support adding secondary datasets from a different connection. This feature is not currently available for Pushdown connections.
- In the Rule Workbench, click the
icon and enter the name of the dataset in the Add Secondary Dataset dialog. For more information, go to Create a data quality rule.
- Add your SQL rule.
- Click Save.
- Update the command line to load the drivers for the other connection type.
- Display Data Findings, and open the Job tab.
- Select the Command Line tab.
- Add the following parameter to the command line:
-addlib "/opt/owl/drivers/bigquery/"
Referencing secondary datasets
When you create a rule with secondary datasets, keep in mind the following points in order to avoid missing data set exceptions:
- Ensure that the name of your first data set matches the name of the main data set on which your rule is created.
- The column names in your table must be exact matches with the syntax of your query.
Note If the column name uses CAPS, then your query must also use CAPS.
When you run a Job that references a secondary dataset, a temporary Job to load the secondary dataset appears on the Jobs page. This Job temporarily changes the dataset name while the Spark process runs. When you click the temporary Job after its status updates to Finished, the Findings page may show Rows: undefined in the Profile details. If this happens, you can refresh the Jobs page until the Job reverts to its actual name, and then click the Job to view the results.