About automatic data classification

In Collibra Data Intelligence Cloud, automatic data classification is a feature that analyzes and predicts the content of registered data sources based on a subset of the data itself, helping you to easily gain insights on what kinds of data you have and where it resides. In other words, data classification automatically (with no human input) assigns “class” values to individual columns of data to identify what kind of data is contained in that column. Examples of different data classes are “name”, “address”, “phone number” and “web browser”.

Why automatic data classification?

When you have ingested data in Data Catalog, the data classification process automatically identifies data structures within the data. As such, it takes less time to learn what kind of data you have ingested.

Data classification via the Data Classification Platform vs. Edge

The following table shows the differences between classification via the Data Classification Platform or via Edge.

Part of process

Classification via the Data Classification Platform

Classification via Edge

Availability

You have enabled Data Classification in Collibra Console. Data classification is a part of the profiling capability of an Edge site. If you have access to Edge, profiling and classification are available.
Sample data The Data Classification platform requires sample data that needs to be stored in your Collibra environment. Data classification via Edge classifies data on the Edge site. Sample data is no longer stored in Collibra cloud.
Anonymization The Data Classification platform uses profiling and sample data to classify. As a result, you cannot classify your data when it is anonymized. Profiling and classification are performed via an Edge site in the customer's environment. The data is anonymized before it is sent to Collibra Data Intelligence Cloud.
Automatic vs. manual classification Data classification must be manually triggered from every table, schema or database. Data classification is automatically triggered after the profiling process on an Edge site.

Retraining

The Data Classification Platform stores your classification selections, along with the associated sample data, to retrain the Classification Model and improve future classification predictions. Currently, data classification on Edge does not retrain the classification model to improve future classification predictions.