About Automatic Data Classification
In Collibra Data Intelligence Cloud, Automatic Data Classification is a feature that analyzes and predicts the content of registered data sources based on a subset of the data itself. This helps you to easily gain insights on what kinds of data you have and where it resides. In other words, data classification automatically (with no human input) assigns “class” values to individual columns of data to identify what kind of data is contained in that column. Examples of different data classes are name, address, phone number, and web browser.
Note Automatic Data Classification looks only at structured data. Unstructured data is out of scope.
Why automatic data classification?
When you have ingested data in Data Catalog, the data classification process automatically identifies data structures within the data. As such, it takes less time to learn what kind of data you have ingested.
Methods to classify data
Data can be classified via the Cloud Data Classification Platform or via Edge.
You can also use the Catalog Data Classification REST API to add data classes, assign data classes to assets, import existing data classifications, start the classification via the Cloud Data Classification Platform, and so on.
The following table shows the differences between Edge and the Cloud Platform.
|
Part of process |
||
|---|---|---|
|
Availability |
You have enabled data classification on Edge. Data classification is part of the profiling capability of an Edge site. If you have access to Edge, profiling and classification are available. | You have set up and enabled the Cloud Data Classification Platform in Collibra Console. |
| Sample data | Data classification via Edge classifies data on the Edge site. Sample data is not stored in Collibra cloud. | The Cloud Data Classification Platform requires sample data that needs to be stored in your Collibra environment. |
| Anonymization | Profiling and classification are performed via an Edge site in your environment. The data is anonymized before it is sent to Collibra Data Intelligence Cloud. | The Cloud Data Classification Platform uses profiling and sample data to classify. As a result, you cannot classify your data when it is anonymized. |
| Automatic or manual start of the data classification | Data classification is automatically triggered after the profiling process on an Edge site. | Data classification must be manually triggered from every table, schema or database. |
| Retraining | Data classification via Edge does not retrain the classification model. This means that:
|
The Cloud Data Classification Platform stores your classification selections, along with the associated sample data. This allows to retrain the classification model to improve future classification predictions. |