About Unified Data Classification

Important

In Collibra 2024.05, we launched a new user interface (UI) for Collibra Platform! You can learn more about this latest UI in the UI overview.

Use the following options to see the documentation in the latest UI or in the previous, classic UI:

Latest UI Classic UI

Note If you're using a Collibra Cloud site, go the Collibra Cloud site documentation to check if your data source is supported.

Unified Data Classification (UDC) is a Collibra data classification method on Edge. It is based on data classes that you can configure and fully modify depending on your own needs.

Important To merge data classes, you need to use the latest UI.

The Unified Data Classification method is enabled by default for all new environments working on Edge starting from release 2024.02, and for all existing environments from 2024.07. It has replaced the old data classification via Edge and data classification via the Cloud Data Classification Platform. A migration process is available to migrate data from old data classification methods to the Unified Data Classification method. UDC is the only supported data classification method.
The Unified Data Classification method works via Edge and requires specific setup.
Because the data doesn't leave your organization's network, the automatic data classification process is secure. The samples used during the automatic data classification process are temporarily added to the Edge site cache. They are not transferred to Collibra.
The method relies on classification rules specified for each data class.
This means that the classification doesn't rely on machine learning, which makes issues and changes more transparent. Using classification rules also provides high flexibility and allows for customizations.
Note The method does remember any rejected data class suggestions, meaning the data class will not be suggested again if you have rejected the data class for an asset. Also, once a data classification has been accepted for a column, the data classification won't be automatically updated if you run the data classification process again.
We deliver optional out-of-the-box data classes.
This means you decide which out-of-the-box data classes you want to use. It also allows you to adjust the provided data classes to your own needs, such as changing the name or classification rules.
You can start a separate classification process for a specific asset via a dedicated Classify button.
UDC is also available via REST APIs: Data Classification REST API v2, Data Class Management REST API v1, Data Class Import REST API v1.
Important The Data Classification REST API v1 ClassificationMatches endpoints are also still valid and can be used by UDC. The other endpoints in this API are deprecated.

Tip You can follow a training and watch videos via Collibra University.

Data classification steps

Data Classification is available for registered data sources and for integrated Databricks Unity Catalog assets.

Step	Description
	Register a data source via Edge and synchronize one or more schemas For Databricks, set up and integrate Databricks Unity Catalog.
	Make sure your environment is set up for Unified Data Classification via Edge
	Configure your data classes in UDC.
	Classify the synchronized data. You can do this manually or start the automatic data classification process.

Understanding the automatic data classification process

The following image demonstrates on a high level how the automatic classification system works.

Image of the data classification flow showing the various steps in the process

A user starts the automatic classification process, which can be started from a Database, Schema, Table, or Column asset.
Catalog classification in Collibra receives the request and queries the knowledge graph to get a list of all the columns that need to be classified. If the classification starts from an asset other than a Column asset, all the child Column assets in the hierarchy are classified.
The classification job is submitted to the Edge or Collibra Cloud site and includes the list of columns to classify.
The classification capability gets the valid data classes, which contain one or more classification rules and are enabled.
The classification capability checks if sample data must be retrieved from the data source. Sample data is required only if the data classes include sample-based classification rules.
1. If no sample data is needed:
  1. For each column, the classification capability compares the metadata to each data class and calculates a confidence score.
  2. Classifications with a confidence score higher than 0 are returned to Collibra. If a data class specifies a minimum confidence threshold above 0, then the classification is returned only if that threshold is reached.
2. If sample data is needed:
  1. The classification capability checks if the required sample data is available in the Edge cache.
  2. If no sample data is available, the classification capability requests sample data from the data source through the defined Edge connection. In this case:
    - The classification capability requests up to 1,000 rows are from the table.
    - The sample data is taken randomly to ensure representative data.
      If supported for the data source, source-driven random sampling is applied.
    - The sample data is temporarily stored in the Edge cache.
      Note The sampling feature also stores sample data temporarily in the Edge cache. However, this does not impact the automatic data classification process, and vice versa.
  3. For each column, the classification capability compares the sample data to each data class and calculates a confidence score.
    - Each sample is evaluated against the data classification rules for a data class. The confidence score for that data class increases proportionally to the number of matching samples if any of the rules match the evaluated sample.
    - Classifications with a confidence score higher than 0 are returned to Collibra. If a data class defines a minimum confidence threshold above 0, then the classification is returned only if that threshold is reached.
Finally, Catalog classification compares the results to the classifications stored in the repository. If necessary, the classifications are updated.
Note Classifications in accepted or rejected states are never updated.