About Data Classification

Data classification is the process of assigning a data class to a column to easily gain insights on what kinds of data you have and where it resides. Examples of different data classes are name, phone number, and web browser.
You can classify data manually or automatically.

About Automatic Data Classification

In Collibra Data Intelligence Platform, automatic data classification is a feature that analyzes and predicts the content of registered data sources based on a subset of the data itself. In other words, automatic data classification suggests a data class for individual columns without human input.

Note Automatic Data Classification looks only at structured data. Unstructured data is out of scope.

Methods to automatically classify data

Multiple methods are available to automatically classify data:

The following table shows the differences between the various methods.

Part of process

Unified Data Classification

Old Classification via Edge

Classification via the Cloud Data Classification Platform

Availability This method is the default Edge data classification method for all new environments from 2024.02. This method is available only for environments existing before 2024.02.
This method will be end of life in 2024, once migration processes to the new data classification method exist.
This method is available only for existing environments before 2024.02, and will be end of life together with Jobserver in 2024. A migration process will become available in the following Collibra.

Enable

You need to enable the feature in Collibra Console and add the data classification capability to the relevant Edge connections.

Note This feature is enabled by default in new environments.

You have enabled data classification on Edge. Data classification is part of the profiling capability of an Edge site. If you have access to Edge, profiling and classification are available. You have set up and enabled the Cloud Data Classification Platform in Collibra Console.
Sample data Data classification via Edge classifies data on the Edge site. Sample data is not stored in Collibra cloud. Data classification via Edge classifies data on the Edge site. Sample data is not stored in Collibra cloud. The Cloud Data Classification Platform requires sample data that needs to be stored in your Collibra environment.
Anonymization N/A N/A The Cloud Data Classification Platform uses profiling and sample data to classify. As a result, you cannot classify your data when it is anonymized.
Automatic or manual start of the data classification The automatic data classification can be triggered manually from a column, table, schema or database. The automatic data classification is triggered automatically after the profiling process on an Edge site. The automatic data classification can be triggered manually from every table, schema or database.
Retraining

This data classification method isn't based on any machine learning process, and is therefore not retraining. The method is based on classification rules.

The method does remember any rejected data class suggestions, meaning the data class will not be suggested again if you have rejected the data class for an asset.

Data classification via Edge does not retrain the classification model.
This means that:
  • Your feedback is only stored and is not used for improving classification.
  • The classification process does not take user-defined classes into account. However, you can create them and assign them manually.
The Cloud Data Classification Platform stores your classification selections, along with the associated sample data. This allows to retrain the classification model to improve future classification predictions.