About Unified Data Classification
Unified Data Classification is a Collibra data classification method on Edge. It is based on data classes that you can configure and fully modify depending on your own needs.
To merge data classes, you need to use the latest UI.
- The Unified Data Classification method is enabled by default for all new environments working on Edge starting from release 2024.02, and for all existing environments from 2024.07. It has replaced the old data classification via Edge and data classification via the Cloud Data Classification Platform. A migration process is available to migrate data from old data classification methods to the Unified Data Classification method. Unified Data Classification is the only supported data classification method.
- The Unified Data Classification method works via Edge and requires specific setup.
Because the data doesn't leave your organization's network, the automatic data classification process is secure. The samples used during the automatic data classification process are temporarily added to the Edge site cache. They are not transferred to Collibra. - The method relies on classification rules specified for each data class.
This means that the classification doesn't rely on machine learning, which makes issues and changes more transparent. Using classification rules also provides high flexibility and allows for customizations.Note The method does remember any rejected data class suggestions, meaning the data class will not be suggested again if you have rejected the data class for an asset. Also, once a data classification has been accepted for a column, the data classification won't be automatically updated if you run the data classification process again.
- We deliver optional out-of-the-box data classes.
This means you decide which out-of-the-box data classes you want to use. It also allows you to adjust the provided data classes to your own needs, such as changing the name or classification rules. - You can start a separate classification process for a specific asset via a dedicated Classify button.
- Unified Data Classification is also available via REST APIs: Data Classification REST API v2, Data Class Management REST API v1, Data Class Import REST API v1.
Important The Data Classification REST API v1 ClassificationMatches endpoints are also still valid and can be used by Unified Data Classification. The other endpoints in this API are deprecated.
Tip You can follow a training and watch videos via Collibra University.
Data classification steps
Step |
Description |
---|---|
|
Register a data source via Edge. |
|
Synchronize one or more schemas. |
|
Make sure your environment is set up for Unified Data Classification via Edge |
|
Configure your data classes in Unified Data Classification. |
|
Classify the synchronized data. You can do this manually |
Understanding the automatic data classification process
The following image demonstrates on a high level how the automatic classification system works.
- A user starts the automatic classification process, which can be started from a Database, Schema, Table, or Column asset.
- Catalog classification in Collibra receives the request and queries the knowledge graph to get a list of all the columns that need to be classified. If the classification started from an asset other than a Column asset, all the child Column assets in the hierarchy will be classified.
- The classification job is submitted to the Edge site. The job contains the list of columns to classify.
- The classification capability gets the valid data classes. Valid data classes contain one or more classification rules and are enabled.
- The classification capability checks if necessary sample data is available in the Edge cache.
- If no sample data is available in the Edge cache, the classification capability requests samples from the data source via the defined Edge connection. In that case:
- The classification capability requests up to 1.000 rows are from the table.
- The samples are taken randomly to obtain representative data.
If the data source supports it, source-driven random sampling (push-down sampling) is applied. - The samples are stored in the Edge cache.
Note The sampling feature also stores sample data temporarily in the Edge cache. However, this does not impact the automatic data classification process, and vice versa.
- For each column, the classification capability compares the sample data to each data class and calculates a confidence score.
- Each sample is compared to the data classification rules for a data class. The confidence score for that data class increases proportionally to the number of matching samples if any of the rules match the evaluated sample.
- All classifications with a confidence score higher than 0 are returned to Collibra. If a data class defines a minimum confidence threshold above 0, then the classification is returned only if that threshold is reached.
- Catalog classification compares the result of the classification process to the classifications stored in the repository. If needed, they are updated.
Note Classifications in accepted or rejected states are never updated.