Automatically classifying assets via the Unified Data Classification method
Automatic data classification is a feature that analyzes and predicts the content of registered data sources based on a subset of the data itself. In other words, automatic data classification suggests a data class for columns without human input.
When you start the automatic data classification process, the process verifies the data in the column or columns against the data classification rules in the data classes, and makes classification suggestions with a confidence score. This score is an estimation based on data samples that the data classification process collects. A deviation from the exact score is possible.
You can then accept or reject the classification suggestions manually or automatically by defining classification thresholds.
To suggest a data class, automatic data classification needs enough data. Columns with very little data may not have a data class suggested.
- The automatic data classification process needs at least 6 values that can be checked, to classify a column.
Example:
For data class A, you define a regular expression and indicate you don't want to consider empty values.
If you then classify a column with a lot of null values and five non-null values, the column won't get a data classification suggestion, even if the non-null values match data class A. - The automatic data classification process will extract a maximum of 1,000 values from the data source.
The samples are temporarily added to the Edge site cache. They are not transferred to Collibra. If the Edge Site cache already contains at least 100 samples for this data source, the automatic data classification process will use those samples.