Overview: Data Quality & Observability Classic core concepts

Data Quality & Observability Classic offers many out-of-the-box tools to assess the quality of your data and help you gain confidence in it.

Data quality jobs and datasets

A dataset is the definition of the data quality job. It maps certain objects, such as custom SQL queries and monitors, to your data.

A data quality job, or DQ Job, is the execution of a dataset definition. The dataset definition, along with other metadata, is sent as a bundle of code directly to the data source when using Pushdown mode or to the Apache Spark compute engine when using Pullup mode for processing. A DQ Job executes a single dataset definition, and regardless of how many times it runs—manually or according to an automated schedule—it still represents the same dataset.

In simple terms, a dataset is the core collection of data included in the package of code that constitutes a DQ Job.

Schedule

A schedule allows you to automatically run your DQ Job according to the days you specify. When you automate the run schedule, you can ensure that scheduling options include:

  • Daily
  • Monthly
  • Quarterly
Example You can view the relationship between a dataset and a DQ Job in the following table. The table shows a DQ Job that runs on a schedule every weekday but does not run on weekends. While the dataset remains the same for each run, each run is a new job. The new job is based on the run date, which is the date the job runs and is part of the dataset definition. This distinction is important because the run date of a scheduled job changes to reflect the day the job runs, rather than remaining static.
SundayMondayTuesdayWednesdayThursdayFridaySaturday
The DQ Job is not scheduled to run automatically on this day.monday jobtuesday jobwednesday jobthursday jobfriday jobThe DQ Job is not scheduled to run automatically on this day.

Tip For more information on scheduling, go to Scheduling a DQ Job to learn how to set an automated run schedule, or Schedule Restrictions to learn how to block certain days or times from automated runs.

Connections

You can create a connection between Data Quality & Observability Classic and your data source to allow Data Quality & Observability Classic to run DQ Jobs on your data. Data Quality & Observability Classic provides out-of-the-box support for many common JDBC and file-based data sources. It uses secure authentication methods supported by each data source to access data within their databases. You can also create a connection to a JDBC data source that is not officially supported by Data Quality & Observability Classic via the Generic JDBC option.

Tip For more information on how to connect to the supported JDBC or file-based data sources, go to Supported JDBC data sources or Supported remote files.

Processing methods

Data Quality & Observability Classic uses two methods to process DQ Jobs: Pushdown and Pullup. The processing method of your data is determined by how your connection is set up.

Pushdown

In Pushdown, DQ Jobs are submitted directly to Pushdown-compatible data sources, such as Databricks, SAP HANA, or Snowflake, where their processing occurs entirely inside the SQL data warehouse.

Tip For more information on Pushdown processing, go to Pushdown processing.

Pullup

In Pullup, all of the processing executes inside the Apache Spark compute engine. Source data is stored inside a database, where Spark reads it out, and the parameters specified when you created the DQ Job are partitioned and sorted. The profile results of the DQ Job are then recorded in the Metastore.

Tip For more information on Pullup processing, go to Pullup processing.

Findings

The Findings page is a dashboard that shows the results and health of a data quality job run. It lets you explore the details of a job run and gives you the ability to drill down into the various data quality dimensions to better understand your dataset.

Data Quality & Observability Classic profiles the data and builds a model for each dataset it scans. This helps Data Quality & Observability Classic understand what 'normal' means within the context of each dataset. As the data changes, the definition of 'normal' also changes. Instead of requiring you to adjust rule settings, Data Quality & Observability Classic continually adjusts the model. This approach enables Data Quality & Observability Classic to provide automated, enterprise-grade data quality coverage that removes the need to write dozens or even hundreds of rules per dataset.

Tip For more information on data quality findings, go to Findings.

Data quality score

The data quality score is an aggregated percentage between 0 and 100 that summarizes the integrity of your data. A score of 100 indicates that Data Quality & Observability has not detected any quality issues, or that such issues are being suppressed.

Depending on the scoring threshold, which consists of predetermined scoring ranges, a data quality score falls into one of the following scoring classifications:

  • Passing: A data quality score higher than or equal to the upper-most scoring threshold. The out-of-the-box passing range is 90-100.
  • Important A passing score does not guarantee the absence of data quality issues. We recommend that you always review the results of DQ Jobs for any underlying issues.

  • Warning: A data quality score that is technically passing but falls between the passing and failing thresholds. Scores in this range may indicate potential data quality issues, and you may consider configuring a score-based alert for such an issue. The out-of-the-box warning range is 76-89.
  • Failing: A data quality score lower than or equal to the lower-most scoring threshold. The out-of-the-box failing range is 0-75.
  • Important Failing scores clearly indicate potential data quality issues, making it essential to configure alerts so recipients can initiate an investigation and take further action.

Example The data quality scores in the following screenshot reflect the three scoring classifications in Data Quality & Observability Classic: passing, warning, and failing.

screenshot of score chart

Behaviors

Data Quality & Observability Classic uses machine learning to learn from column-level profiling to create adaptive rules. These rules contribute to the overall behavior score. Adaptive rules automatically observe and adapt to changes in numeric data representations over time. They down-score any values outside defined boundaries.

Tip For more information on behaviors on the Findings page, go to Behaviors, or go to the Adaptive Rules tab on Add layers to learn how to configure them as part of a DQ Job.

Data quality monitors

Data quality monitors are out-of-the-box or user-defined SQL queries that provide observational insights into the quality and reliability of your data.

When Data Quality & Observability Classic observes a change or anomaly, it records the finding for each data quality monitor and includes a numeric indicator next to the tab where it detects the issue. You can drill down into the various data quality monitor tabs on the Findings page to better understand the quality and reliability of your dataset.

You can configure monitors for:

  • Custom SQL rules based on your business needs.
  • Numerical and categorical outliers.
  • Cross-column relationships between string value patterns.
  • Row count, schema, and cell value differences between source and target data.
  • Missing records.
  • Schema changes.
  • Duplicate records.
  • Data shape format anomalies.

Tip For more information on monitors on the Findings page, go to Data quality monitors, or go to Add layers to learn how to configure them as part of a DQ Job.

Assignments

Assignments allow you to assign DQ Job observations from the Findings page to other users for review. This helps determine whether the observations are legitimate and is an important step in remediating problematic records outside of Data Quality & Observability Classic.

Tip For more information on assigning observations to users, go to Working with Assignments.

Profile

Data profiling provides a detailed analysis of your data's behavior and trends over time. It forms the foundation of a robust data quality and observability strategy.

When you run a DQ Job, the scan results include insights such as column-level statistics, charts, and other data quality and observability metrics. These insights help you identify common patterns and emerging trends, providing a better understanding of the structure and quality of your data.

Tip For more information on data profiling metrics, go to Profile.

Rules

Data quality rules are custom or out-of-the-box SQL conditions that help you verify whether your data meets your organization's business requirements. Detecting deviations from standards is essential for regulatory compliance, identifying unusual or inaccurate records, and maintaining the overall health of your organization's data. While Data Quality & Observability Classic automatically performs various behavioral observations and creates adaptive rules based on its findings, you can write specific compliance-related rules or import your own.

The Rule Workbench is where you can create custom rules that fit your business needs. You can use the AI SQL Assistant for Data Quality, write your own SQL, and add out-of-the-box or custom data classes and templates to get the job done.

Tip For more information on rules, go to DQ Rules, or go to Create a data quality rule to learn how to create new rules.

Alerts

Alerts are emails or webhooks that notify users to data quality events and issues as they occur. To trust your data, it is essential to stay informed of potential data quality issues as soon as they are detected. You have full control over how alerts are configured.

You can configure alerts to notify relevant stakeholders to issues, such as a failed job or a lack of data returning for a certain number of days despite successful job runs.

Tip For more information on alerting, go to Alerts.

Global Search

The global Search option shows at the top of every Data Quality & Observability Classic page. You can search for datasets across all of Data Quality & Observability Classic. Datasets with missing run IDs also appear in the global search.

Global Search bar

When you enter your search criteria and select a dataset, options show to view the Profile or Findings page for the dataset.

Global Search bar with results

Application Information

The Application Information pop-up shows your environment information, including the environment name, application version, JDK version, Spark version, and Vertex model.

To access this information, click the Help icon in the top-right of the Home page. In this pop-up, you can:

  • Click Get Environment Details to copy the application information. This makes it easier to share environment details for support purposes.

  • Click Show Web Audit Logs to view the most recent key function web logs in a table. See View Audit Web Logs for more information.