About Data Classification

Data classification is the process of assigning a data class to a column, to easily gain insights on what kinds of data you have and where it resides. Examples of different data classes are name, phone number, and web browser.
You can classify data manually or automatically.

The automatic data classification process is not available in on-premises environments. You can, however, manually classify your data.

About Automatic Data Classification

In Collibra Data Intelligence Cloud, automatic data classification is a feature that analyzes and predicts the content of registered data sources based on a subset of the data itself. In other words, automatic data classification suggests a data class for individual columns without human input.

Note Automatic Data Classification looks only at structured data. Unstructured data is out of scope.

Methods to automatically classify data

Data can be classified via the Cloud Data Classification Platform or via Edge.

You can also use the Catalog Data Classification REST API to add data classes, assign data classes to assets, import existing data classifications, start the classification via the Cloud Data Classification Platform, and so on.

Important A new data classification method on Edge, Unified Data Classification, is available in beta testing.

The following table shows the differences between Edge and the Cloud Platform.

Part of process

Classification via Edge

Classification via the Cloud Data Classification Platform

Availability

You have enabled data classification on Edge. Data classification is part of the profiling capability of an Edge site. If you have access to Edge, profiling and classification are available. You have set up and enabled the Cloud Data Classification Platform in Collibra Console.
Sample data Data classification via Edge classifies data on the Edge site. Sample data is not stored in Collibra cloud. The Cloud Data Classification Platform requires sample data that needs to be stored in your Collibra environment.
Anonymization Profiling and classification are performed via an Edge site in your environment. The data is anonymized before it is sent to Collibra Data Intelligence Cloud. The Cloud Data Classification Platform uses profiling and sample data to classify. As a result, you cannot classify your data when it is anonymized.
Automatic or manual start of the data classification Data classification is automatically triggered after the profiling process on an Edge site. Data classification must be manually triggered from every table, schema or database.
Retraining Data classification via Edge does not retrain the classification model.
This means that:
  • Your feedback is only stored and is not used for improving classification.
  • The classification process does not take user-defined classes into account. However, you can create them and assign them manually.
The Cloud Data Classification Platform stores your classification selections, along with the associated sample data. This allows to retrain the classification model to improve future classification predictions.