About Unified Data Classification

Note If you're using a Collibra Cloud site, go the Collibra Cloud site documentation to check if your data source is supported.

Unified Data Classification (UDC) is a Collibra data classification method on Edge. It is based on data classes that you can configure and fully modify depending on your own needs.

Steps overview: Data classification

Data classification is available for registered JDBC data sources and for Databricks Unity Catalog and Dataplex Universal Catalog data sources integrated via Edge.

Step

Description

Integrate the Databricks or Dataplex Universal Catalog data sources.

To allow for classification, add a JDBC connection in the Databricks Unity Catalog synchronization or Dataplex Universal Catalog capability. During synchronization, a Catalog Data Classification capability is created automatically if it does not already exist. As a result, you do not need to create a separate Catalog Data Classification capability to classify data from integrations.

For more information, go to Steps: Integrate Databricks Unity Catalog via Edge or Steps: Integrate Google Dataplex Universal Catalog via Edge.

Make sure your environment is set up for Unified Data Classification via Edge

Configure your data classes in UDC.

Classify the synchronized data.
You can do this manually or start the automatic data classification process.

Understanding the automatic data classification process

The following image demonstrates on a high level how the automatic classification system works.

Image of the data classification flow showing the various steps in the process

  1. A user starts the automatic classification process, which can be started from a Database, Schema, Table, or Column asset.
  2. Catalog classification in Collibra receives the request and queries the knowledge graph to get a list of all the columns that need to be classified. If the classification starts from an asset other than a Column asset, all the child Column assets in the hierarchy are classified.
  3. The classification job is submitted to the Edge or Collibra Cloud site and includes the list of columns to classify.
  4. The classification capability gets the valid data classes, which contain one or more classification rules and are enabled.
  5. The classification capability checks if sample data must be retrieved from the data source. Sample data is required only if the data classes include sample-based classification rules.
    1. If no sample data is needed:

      1. For each column, the classification capability compares the metadata to each data class and calculates a confidence score.
      2. Classifications with a confidence score higher than 0 are returned to Collibra. If a data class specifies a minimum confidence threshold above 0, then the classification is returned only if that threshold is reached.
    2. If sample data is needed:
      1. The classification capability checks if the required sample data is available in the Edge cache.
      2. If no sample data is available, the classification capability requests sample data from the data source through the defined Edge connection. In this case:
        • The classification capability requests up to 1,000 rows from the table.
        • The sample data is taken randomly to ensure representative data.
          If supported for the data source, source-driven random sampling is applied.
        • The sample data is temporarily stored in the Edge cache.

          Note The sampling feature also stores sample data temporarily in the Edge cache. However, this does not impact the automatic data classification process, and vice versa.

      3. For each column, the classification capability compares the sample data to each data class and calculates a confidence score.
        • Each sample is evaluated against the data classification rules for a data class. The confidence score for that data class increases proportionally to the number of matching samples if any of the rules match the evaluated sample.
        • Classifications with a confidence score higher than 0 are returned to Collibra. If a data class defines a minimum confidence threshold above 0, then the classification is returned only if that threshold is reached.
  6. Finally, Catalog classification compares the results to the classifications stored in the repository. If necessary, the classifications are updated.

    Note Classifications in accepted or rejected states are never updated.

Helpful resources

Follow the Unified Data Classification elearning on Collibra University.