Create a Data Quality Job
Data Quality Jobs regularly monitor specific data to identify issues before they affect your business. A Data Quality Job includes a scope query to select data from one or more tables, and various settings to define how and when to run the job with specific monitors. Data Quality Jobs allow you to:
- Conduct immediate and scheduled data quality checks on your data.
- Create data profiles of your data.
- Apply automatic monitoring to track the evolution of your data.
- Configure automated notifications to send to users when certain conditions are met.
Important A Data Quality Job is not an asset; however, it is presented in Collibra in a way that closely resembles an asset.
Step | Description | Required? |
---|---|---|
1. Select the data to monitor | Select a table from Monitoring Overview, then create a Data Quality Job on it. |
|
2. Add monitors | Select the data quality monitors to apply to your job. |
|
3. Set a run schedule | Schedule a Data Quality Job to run automatically at a specified frequency. |
|
4. Add notifications | Configure notifications to send to job owners when a job:
|
|
5. Review and run | Review the SQL query or JSON payload, then save and run your Data Quality Job. |
|
Prerequisites
- The DQ Pushdown Edge capability is enabled for your data source.
- You have a global role with the Product Rights > Data Quality global permission or the following permissions based on the actions you want to perform:
- To create quick monitoring jobs, you must have the Data Quality Manager resource role with the Data Quality Job > Create resource permission.
- To schedule jobs, you must have the Data Quality Editor or Data Quality Manager resource role with the Data Quality Job > Schedule resource permission.
- To run quick monitoring jobs, you must have the Data Quality Editor or Data Quality Manager resource role with the Data Quality Job > Run resource permission.
- If you want to use notifications, ensure that the users who need to receive notifications have a Collibra user account.
Steps
0 Open the Data Quality Job stepper
- From Monitoring Overview, expand your data source, then expand a schema within it.
The available tables within the schema expand. - Click the table where you want to run a Data Quality Job.
The table opens on the right side of the page. - Click + Data Quality Job on the upper right corner of the page.
The Create Data Quality Job stepper opens to the Data step.
Note If a table does not yet have monitoring, click + Data Quality Job in the center of the page, to the right of the data explorer.
1 Select the data to include in the scan
- Optionally, give your Data Quality Job a unique name in the Job Name field. The Job Name cannot contain special characters, other than -._.
- Optionally, click Select columns to include specific columns from the selected table in your Data Quality Job.
Tip All columns in your table are included in the Data Quality Job by default. If you want to run a Data Quality Job on the entire table, you do not need to modify the default column selections.
- Select the checkbox option next to the columns you want to remove from your Data Quality Job in the right pane.
- Click
to move the selected columns from the right pane to the left pane and remove the columns from your Data Quality Job.
- Optionally, click
to move any selected columns in the left pane back to the right pane to include them in your Data Quality Job.
- Optionally, select Sample, then enter a specific value in the Sample rows field to include that number of rows in the sample.
- Click Next.
The Monitors step opens.
Tip The Job Name field uses a default name, following a schemaName.tableName_number format. A number at the end of the name, such as schemaTest.tableExample_2, indicates that another Data Quality Job called schemaTest.tableExample already exists.
Note When there are more than 50 columns in your table, the pane is paginated.
Tip When there are multiple available columns to select in one of the panes, you can select all columns you want to move to the other pane, then click or
to move them in bulk.
2 Add monitors
After selecting the data to include in the scan, you need to add monitors to your Data Quality Job. All Data Quality Jobs include data type and schema change checks.
- Data type
- Schema evolution
- Row count
- Uniqueness
- Null values
- Empty fields
- Optionally, select Allow descriptive statistics in profiling to show the results of checks for min, median, max, Q1 and Q3 values from numeric columns in the UI.
- Select the monitors to include or exclude in your Data Quality Job.
- Optionally, click
Advanced, then click and drag the sliders or enter a value in the fields to configure the settings.
- Click Next.
The Schedule step opens.
Warning Allowing descriptive statistics in profiling may include sensitive values if they are present in the columns you include in the scan.
Monitor | Description | Default on? |
---|---|---|
Row count | Tracks changes to the total number of rows. |
|
Uniqueness | Tracks changes in the number of distinct values across all columns. |
|
Null values | Detects changes in the number of NULL values in all columns. |
|
Empty fields | Detects changes in the number of empty values in numeric columns. |
|
Min value | Detects changes in the lowest value in numeric columns. |
|
Mean value | Detects changes in the average value in numeric columns. |
|
Max value | Detects changes in the highest value in numeric columns. |
|
Execution time | Tracks changes in the execution time of the data job. |
|
Note If you do not change the default options, row count, uniqueness, and null and empty values monitors will be included in your Data Quality Job.
Setting name | Description | Default setting |
---|---|---|
Data Lookback | The number of runs to train the adaptive monitoring for expected values. | 10 (runs) |
Learning Phase | The minimum number of runs to train the adaptive monitoring before it can begin to suggest if the adaptive rules are breaking or passing. | 4 (runs) |
3 Set a run schedule
- Select the automated run frequency of the Data Quality Job from the Repeat drop-down list. The available options differ depending on the selected Repeat option.
- Optionally, configure the compute resourcing.
- Optionally, click
Advanced.
The Run date offset, Connections, and Threads settings expand.
- Select an option from the Run date offset drop-down list.
- In the Connections option, click and drag the slider or enter a value between 1 and 50 in the field. The default setting is 10.
- In the Threads option, click and drag the slider or enter a value between 1 and 10 in the field. The default setting is 2.
- Optionally, include a time slice to run the Data Quality Job on a moving subset of time, or enable back runs to start Data Quality Job runs on historical data.
- View the row count distribution of the date column you selected for your time slice as a chart.
- View the row count distribution of the date column you selected for your time slice as a table. You can click
to switch to the table display.
- Select an option from the drop-down list to display the distribution in days, weeks, or months.
- Click any of the bars in the chart to apply your selection to the start date.
- Click Next. The Notifications step opens.
Repeat options | Description | Run start options |
---|---|---|
Never |
Your Data Quality Job doesn't run automatically along a schedule. No additional Repeat configuration is required. Note When you select Never, the job runs only once, when you initiate the job run. |
None |
Hourly | Your Data Quality Job runs every hour, at the minute you set in the Run start minute option. |
Run start minute Enter a value between 00 and 59 in mm format, or click |
Daily | Your Data Quality Job runs every day, at the time you set in the Run start time option. | Run start time
Enter a value between 01:00 and 12:59 in hh:mm format, or click Note Because run start time uses a 12-hour clock, you need to select AM or PM from the |
Weekly on | Your Data Quality Job runs every week, on the day(s) of the week and time you set in the Run start time option. | Run days
Select one or more days on which you want your Data Quality Job to run automatically at the designated run start time. Run start time Enter a value between 01:00 and 12:59 in hh:mm format, or click Note Because run start time uses a 12-hour clock, you need to select AM or PM from the |
Weekdays | Your Data Quality Job runs every day, except Saturday and Sunday, at the time you set in the Run start time option. | Run start time
Enter a value between 01:00 and 12:59 in hh:mm format, or click Note Because run start time uses a 12-hour clock, you need to select AM or PM from the |
Monthly | Your Data Quality Job runs every month, on either the first or last day of the month, or the calendar day number and time you set in the Run start time option. |
Run day Select a day of the week on which you want your Data Quality Job to run automatically at the designated run start time. You can use the drop-down list to run your Data Quality Job monthly, on the first or last day of the month you select. Alternatively, you can select a specific date from the Day number drop-down list to run your Data Quality Job every month on a fixed date. Run start time Enter a value between 01:00 and 12:59 in hh:mm format, or click Note Because run start time uses a 12-hour clock, you need to select AM or PM from the |
Setting name | Description | Default setting |
---|---|---|
Run date offset |
Allows jobs to run on data from a time range other than the scheduled date. It dynamically defines the date inserted into your job query as the run date ( For example, to use the last day of the previous month as the run date, select Last of previous month from the drop-down list. This configuration inserts the last day of the previous month into your job query as the Example If the current month is January and you select Last of previous month as the run date offset, the new run date is set to December 31. The options correspond to your selection in the Repeat option. For example, if you select Monthly in the Repeat option, the Run date offset options reflect a monthly cadence. Note This option is not available when the Repeat option is set to Never. |
Scheduled date |
Connections |
Refers to the number of simultaneous connections to the database. While increasing the number of connections can improve the processing speed of a job, it can also impact compute resourcing. |
10 (connections) |
Threads | Refers to the number of concurrent threads for the Data Quality Job. While increasing the number of threads can improve the processing speed of a job, it can also impact compute resourcing. | 2 (threads) |
Option | Description |
---|---|
Time slice |
A time slice filter is a dynamic filter that scans a range of dates or times to include in a job run. It allows you to perform incremental Data Quality Job runs on data over a moving subset of time. Time slice uses the current time stamp as your run date to isolate the latest time slice data from your table and runs a Data Quality Job on that portion for that day. For example, a run from the previous day checks for yesterday's date and groups it as a time slice. As long as a column that includes today's date exists in the table, a run from the current day will create a time slice for today's date and include it in the run. Note
Enter a number greater than 0, then use the time slice drop-down list to change the time slice to hours, weeks, or months. If you skip this step, the time slice format defaults to days. |
Date column | A column in your data with a date or date time format. The date column is used as the run date (rd ) in your job's SQL query. |
Convert to date data type
|
Transforms a column from a non-date format to a synthetic date column. You can use this option when your date column is not in YYYY-MM-DD or YYYY-MM-DD HH:MM:SS date time format. Using your data source’s SQL syntax, enter SQL to cast the date column into a date format. For example, |
Data type of time slice column |
Select whether the data type of the time slice column you choose in the Date column uses a datetime or timestamp format, such as YYYY-MM-DD HH:MM or Wed Sep 23, 1998 05:06 UTC (as shown in the "show distribution" section below), or a traditional date format, such as YYYY-MM-DD. |
Back runs |
Back runs allow you to create one historical DQ Job run for each interval of the selected time bin. For instance, to create a monthly back run for a certain number of past months, set the back run value to the number of months for which you want historical data and select Months from the time bin drop-down list. Example To create one back run for each of the last 5 months, set the back run value to 5 and the time bin to Months. When you select this option, the back run value field, time bin drop-down list, and Show distribution button appear. Enter a number greater than 0 in the back run value field, then select an option from the time bin drop-down list to change the back run grouping to days, months, or years. |
Show distribution
|
When you click Show distribution, you can do the following after the distribution loads. Note This display option is selected by default. When the table display is selected, you can click
|
4 Add notifications
- Select the notification options.
- Optionally, in the Global custom message input field, enter a custom message to include with each notification.
- Optionally, switch on Individual custom messages in the upper-right corner, and enter a unique message for each applicable notification option.
- Click the Recipients drop-down list and select a user to add them to the list of recipients, or click
to the right of a recipient's username to remove them.
- Click Next. The Review step opens.
Notification option | Sends a notification to the specified recipients when... | Default on |
---|---|---|
Job failed | Your Data Quality Job does not complete a run. |
|
Job completed | Your Data Quality Job completes a run. |
|
Rows <= |
The number of rows is less than or equal to the value you set. Enter a value greater than 0. |
|
Runs without data >= |
The number of runs where your Data Quality Job contains no data are greater than or equal to the value you set. Enter a value greater than 0. |
|
Days without data >= |
The number of days without data is greater than or equal to the value you set. Enter a value greater than 0. |
|
Score <= |
The data quality score is lower than or equal to the value you set. Enter a value between 0 and 100. |
|
Run time minutes > |
The run time of your data exceeds the value you set, in minutes. Enter a value greater than 0. |
|
Note When individual custom messages are enabled, they override the global custom message.
Note The drop-down list of users includes all active users in Collibra.
5 Review and run
The review step provides a summary of the settings of your Data Quality Job.
- Review the SQL query or JSON payload.
- Click Save and run now. The job is submitted to the jobs queue and appears in the Job column of its table of origin in the Monitoring Overview page.
Note The code under the Query and JSON tabs is a read-only preview of your job query.
You can open the Job Details page via the Monitoring Overview page. While the job runs, you will see a "Running" status on the Job Details page and
Note If information is incorrect or missing, such as an incorrect data type in a date column, clicking Save and run now returns you to the step with the issue. You must resolve it before you can run the job.
You can now review the details of your Data Quality Job on the Job Details page.