About automatically classifying assets
Automatic data classification analyzes and predicts the content of registered data sources using a subset of the data. This feature suggests data classes for columns without requiring your manual input.
Automatically classification process
When you start the automatic data classification process, the process verifies column data against the data classification rules in the data classes, and provides classification suggestions with a confidence score. This score represents the system's certainty based on analyzed data samples. A deviation from the exact score is possible.
You can manage these suggestions in two ways:
- Manually: Review the suggestions, and accept or reject suggestions yourself.
- Automatically: Define classification thresholds to let Collibra handle suggestions based on their confidence scores. For example, any suggestion with over 90% confidence can get accepted without human intervention.
Important considerations
-
Automatic data classification looks only at structured data. Unstructured data is out of scope.
-
To suggest a data class, automatic data classification requires enough data. Columns with very little data may not receive a data class suggestion.
-
The process needs at least 6 values that can be checked, to classify a column.
Example You define a regular expression for "Data class A" and set it to ignore empty values. If you classify a column containing mostly null values and only five non-null values, the column will not receive a suggestion—even if those five values match the rule.
- The process extracts up to 1,000 values from the data source.
These samples are temporarily added to the Edge site cache. They are not transferred to Collibra.
If the Edge site cache already contains at least 100 samples for this data source, the process uses those samples.
-
-
The method remembers any rejected data class suggestions, meaning a data class will not be suggested again if you have rejected the data class for an asset. Also, once a data classification has been accepted for a column, the data classification won't be automatically updated if you run the data classification process again.
Start automatic classification via the Unified Data Classification method