About Automatic Data Classification

In Collibra Data Intelligence Cloud, Automatic Data Classification is a feature that analyzes and predicts the content of registered data sources based on a subset of the data itself. This helps you to easily gain insights on what kinds of data you have and where it resides. In other words, data classification automatically (with no human input) assigns “class” values to individual columns of data to identify what kind of data is contained in that column. Examples of different data classes are name, address, phone number, and web browser.

Note Automatic Data Classification looks only at structured data. Unstructured data is out of scope.

Why automatic data classification?

When you have ingested data in Data Catalog, the data classification process automatically identifies data structures within the data. As such, it takes less time to learn what kind of data you have ingested.

Methods to classify data

Data can be classified via the Cloud Data Classification Platform or via Edge.

You can also use the Catalog Data Classification REST API to add data classes, assign data classes to assets, import existing data classifications, start the classification via the Cloud Data Classification Platform, and so on.

The following table shows the differences between Edge and the Cloud Platform.

Part of process	Classification via Edge	Classification via the Cloud Data Classification Platform
Availability	You have enabled data classification on Edge. Data classification is part of the profiling capability of an Edge site. If you have access to Edge, profiling and classification are available.	You have set up and enabled the Cloud Data Classification Platform in Collibra Console.
Sample data	Data classification via Edge classifies data on the Edge site. Sample data is not stored in Collibra cloud.	The Cloud Data Classification Platform requires sample data that needs to be stored in your Collibra environment.
Anonymization	Profiling and classification are performed via an Edge site in your environment. The data is anonymized before it is sent to Collibra Data Intelligence Cloud.	The Cloud Data Classification Platform uses profiling and sample data to classify. As a result, you cannot classify your data when it is anonymized.
Automatic or manual start of the data classification	Data classification is automatically triggered after the profiling process on an Edge site.	Data classification must be manually triggered from every table, schema or database.
Retraining	Data classification via Edge does not retrain the classification model. This means that: Your feedback is only stored and is not used for improving classification. The classification process does not take user-defined classes into account. However, you can create them and assign them manually.	The Cloud Data Classification Platform stores your classification selections, along with the associated sample data. This allows to retrain the classification model to improve future classification predictions.