Core concepts

Data Quality & Observability offers many out-of-the-box tools to assess the quality of your data and help you gain confidence in it.

Data Quality Job

A Data Quality Job is a group of columns from one or more tables that are evaluated by specific monitors and profiling actions. These monitors help you ensure the data within the job's scope is accurate and reliable for reporting and analysis.

Important A Data Quality Job is not an asset; however, it is presented in Collibra in a way that closely resembles an asset.

A Data Quality Job consists of the following components:

  • Scope query
  • Schedule
  • Filters:
    • Time slice
    • Row
    • Limit (sample size)
  • Logs
  • Profile
  • Monitors
  • Permissions

Data quality score

The data quality score is an aggregated percentage between 0 and 100 that summarizes the integrity of your data. A score of 100 indicates that Data Quality & Observability has not detected any quality issues, or that such issues are being suppressed. When a score meets the out-of-the-box or custom criteria to trigger a notification, Collibra sends a notification to the assigned recipients.

Depending on the scoring threshold, which consists of predetermined scoring ranges, a data quality score falls into one of the following scoring classifications:

  • Passing: A data quality score higher than or equal to the upper-most scoring threshold. The out-of-the-box passing range is 90-100.
  • Important A passing score does not guarantee the absence of data quality issues. We recommend that you always review the results of Data Quality Jobs for any underlying issues.

  • Warning: A data quality score between the passing and failing threshold. The out-of-the-box warning range is 76-89.
  • Failing: A data quality score lower than or equal to the lower-most scoring threshold. The out-of-the-box failing range is 0-75.
  • Important Failing scores clearly indicate potential data quality issues, making it essential to notify recipients so they can initiate an investigation and take further action.

Example The data quality scores in the following screenshot reflect the various out-of-the-box scoring classifications as they are shown in the run history chart on the Monitors tab of a Data Quality Job.

screenshot of score chart

  • In the first segment, the score is 100. Because this is a passing score, no notifications are sent to assigned recipients, even when score-based notifications are enabled.
  • The second and third segments both show a failing score of 0. If score-based notifications are enabled, notifications are sent to assigned recipients. Failing scores clearly indicate potential data quality issues, making it essential to notify recipients so they can initiate an investigation and take further action.
  • The fourth segment shows a warning score of 85. Since the out-of-the-box score for sending notifications to assigned recipients is 75 or lower, no notification is sent. If you want to be notified of warning scores, consider aligning the score notification limit with the upper bound of the warning range in your scoring threshold.
  • In the fifth segment, the score of 92 falls within the passing score range. Similar to the first segment, no notifications are sent to assigned recipients; however, we still recommend that you review the Data Quality Job results for any potential issues.

Tip You can adjust the scoring thresholds to meet your organization's needs.

The data quality score shown in the following screenshot is shown in the About panel on the right side of the Job Details page. This score represents the score of the latest job run and may change over time as the results of your Data Quality Job runs evolve.

screenshot of data quality score on "about" panel

Data profiling

Data profiling provides a detailed analysis of your data's behavior and trends over time. It forms the foundation of a robust data quality and observability strategy.

When you run a Data Quality Job, the scan results include insights such as column-level statistics, charts, and other data quality and observability metrics. These insights help you identify common patterns and emerging trends, providing a better understanding of the structure and quality of your data.

Tip For more information on data profiling metrics, go to About the data profile.

Terms Description
Statistics
The profile observability statistics of your column.
Min
The minimum string length when the data type is string.
Median
The median value. This is "N/A" when the data type is string or date.
Max
The maximum string length when the data type is string.
1st Quartile
The 25th percentile of the data in the Data Quality Job.
3rd Quartile
The 75th percentile of the data in the Data Quality Job.
Min/Max Scale
The normalization of the data into values between the minimum value of 0 and the maximum of 2.
Min/Max Precision
The minimum and maximum number of numeric places after the decimal point for columns containing decimal type data.
Top Shapes

The string data type cell values that appeared the most and least frequently in the column of reference. This helps you identify inconsistent data types or unexpected string values, recognize frequently recurring patterns, and understand the distribution of data in the column.

The following data types are detected:

  • Custom Shape
  • Double
  • Int
  • Number
  • String
Defined Type The type of data as defined by the data source.
Inferred Type The type of the data detected by Data Quality & Observability when evaluating the values contained in the column. When the inferred data type does not match the defined type, Data Quality & Observability marks it as a mismatch with an indicator.
Completeness

The percentage of cells in a column that contain values identified as actual values, null, or empty. Valid values are considered complete, whereas nulls and empties are considered incomplete.

Complete values are:

  • Numerical data observed in a numerical column.
  • Non-numerical data observed in a non-numerical column.
  • Non-numerical data observed in a numerical column.
  • Numerical data observed in a non-numerical column.

Null fields do not contain a value at all.

Empty values are data with string instances without any length.

Values The number of unique values in a column.
Values +/- % The percentage change in the number of complete values in a column from the current Data Quality Job run in comparison to the baseline count.
Nulls The number of null fields in a column.
Nulls =/- % The percentage change in the number of null fields in a column from the previous Data Quality Job run in comparison to the baseline count.
Empties The number of rows in the column that contain no value.
Empties +/- % The percentage change in the number of rows in the column that contain no value in the current Data Quality Job run in comparison to the baseline count.

Data quality monitors and dimensions

Data quality monitors are out-of-the-box or user-defined SQL queries that provide observational insights into the quality and reliability of your data. Each monitor is associated with a default data quality dimension. Data quality dimensions categorize data quality findings to help communicate the types of issues detected.

Monitor types

Monitor name Description
Schema change

Schema evolution changes, such as columns that are added, updated, or deleted.

By default, schema change is assigned to the Integrity dimension.

Data type check

Changes to the inferred data type for a given column.

By default, data type checks are assigned to the Validity dimension.

Row count

Tracks changes to the number of rows in the Data Quality Job.

By default, row count is assigned to the Completeness dimension.

Uniqueness

Finds changes in the number of distinct values in a field in all columns.

By default, uniqueness is assigned to the Duplication dimension.

Null values

Detects changes in the number of null values in all columns.

By default, null values are assigned to the Completeness dimension.

Empty fields

Finds changes in the number of empty values in all columns.

By default, empty fields are assigned to the Completeness dimension.

Min value

Detects changes in the lowest value in numeric columns.

By default, min value is assigned to the Accuracy dimension.

Max value

Detects changes in the highest value in numeric columns.

By default, max value is assigned to the Accuracy dimension.

Mean value

Detects changes in the average value in numeric columns.

By default, mean value is assigned to the Accuracy dimension.

Execution time

Tracks changes in the data quality job execution time.

By default, execution time is assigned to the Consistency dimension.

Dimension types and their associated monitors

Each out-of-the-box data quality dimension is associated with a monitor, as described in the following table.

Dimension Description
Accuracy

The degree to which data correctly reflects its intended values.

Monitors associated with the Accuracy dimension: Min, max, and mean values.

Completeness

The percentage of cells in a column that contain values identified as actual values, null, or empty. Completeness refers to the percentage of columns that have neither EMPTY nor NULL values.

Monitors associated with the completeness dimension: Row count, null values, and empty fields.

Consistency

The degree to which data contains differing, contradicting, or conflicting entries.

Monitor associated with the consistency dimension: Execution time.

Integrity

The legitimacy of data across formats and as it's managed over time. It ensures that all data in a database can be traced and connected to related data.

Monitor associated with the integrity dimension: Schema change.

Validity

The degree to which data conforms to its defining constraints or conditions, which can include data type, range, or format.

Monitor associated with the validity dimension: Data type check.

Duplication

The degree to which data contains only one record of how an entity is identified. Refers to the cardinality of columns in your dataset.

Monitor associated with the duplication dimension: Uniqueness.

Notifications

Notifications are emails that alert users to data quality events and issues as they occur. To trust your data, it is essential to stay informed of potential data quality issues as soon as they are detected. You have full control over how notifications are configured.

You can configure notifications to alert relevant stakeholders to issues with a physical data asset, such as a failed job or a lack of data returning for a certain number of days despite successful job runs.

Tip For more information about how to set up notifications, go to Create a Data Quality Job.

Schedule

Schedules trigger a series of monitors to run automatically against a Data Quality Job at specified intervals. You can set schedules to run hourly, daily, weekly on specific days and times, on weekdays, or monthly. Schedules ensure automated data quality coverage without the need for manual intervention on a given date.

Tip For more information about how to schedule a series of monitors to run automatically, go to Create a Data Quality Job.

What's next?

Before diving into the full Data Quality Job creation process, we recommend checking out some other resources to help you get started.