Create a data class
Important Unified Data Classification is in beta testing. Only activate this feature in your Test environments. Don't enable it in Production environments yet because it's not fully ready.
You can create data classes via the asset pages where you can update the classification or via the Data Classification page in the Stewardship application.
To update and delete data classes, especially the data classification rules, you must use the Data Classification page in the Stewardship application.
Before you begin
You have enabled the Unified Data Classification method.
Required permissions
You have a global role that has the Data Classes > Add global permission.
You have a global role that has the Data Classes > Update global permission.
You have a global role that has the Data Classes > Remove global permission.
Steps
Watch a video
-
On the main menu, click
, and then click
Stewardship. - Click the Data Classification tab.
- If the data class doesn't exist yet:
- Click Add.
- Type the name of the data class and press Enter.
- Click Create.
- Select the data class that you want to configure.
- The data class parameters appear in a pane on the right-hand side.
- Optionally, add a description by clicking the Edit icon
next to the Description field. - Open the Details section.
- Complete the fields as required.
Data class element Description Minimum confidence thresholdThe confidence percentage that must be reached for the data class to be considered as a classification result. The confidence percentage is the percentage of values in the column that match the classification rule, for example, the regular expression.
Enter a value between 0 and 100.
The default value is 0.Example If you add value 80 in this field, the data class is returned by the automatic classification process only if the confidence percentage reaches 80 percent or higher.
Tip Confidence scores of 0 are never taken into account.
Include empty valuesIndicates if you want to include empty values in the confidence percentage calculation.
The possible values are:
True True: If the value is set to true, empty values are taken into account by the classification process when calculating the confidence percentage of a matching data class.
False False (Default): If the value is set to false, only the non-empty values are taken into account by the classification process when calculating the confidence percentage of a matching data class.
This option can be used to receive an accurate confidence score for all data in a column.
ExampleYou have a column Z with 40 empty values and 60 phone numbers. You have a data class A with a regular expression to detect US phone numbers.
- If you set this option to False and you classify column Z, data class A could be suggested with a confidence percentage of 100.
- If you set this option to True and you classify column Z , data class A could be suggested but with a confidence percentage of only 60.
Important Some regular expressions are constructed to allow a match with empty values. This means that, through the regular expression, empty values can be matched to the data class, which affects the confidence score.
Example:
This expression won't match empty values with the email data class:^([a-zA-Z0-9._%\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,6})$
This expression will match empty values with the email data class:^([a-zA-Z0-9._%\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,6})*$ExamplesSome examples of values that match the classification rule for the data class. - To change a value, click the Edit icon
.
To save the value, click the Save icon. - Open the Classification rules section.
- Click Add new rule.
A data class without a classification rule can be used only for manual classification. You need to add at least one classification rule to allow data classification based on the data class. - From the Type list, select the type of classification rule that you want to add to the data class.
Depending on your selection, extra fields appear. -
Complete the fields as required.
Data class element Description Regular expressionThis field appears if you select a classification rule of the type Regular expression.
A regular expression, also referred to as regex or regexp, is a sequence of characters that specifies a match pattern in text. Multiple regular expression grammar variants exist. We use the Java variant.
Example A regular expression for an email address can be
^[a-zA-Z0-9._%\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,6}$Tip- Multiple websites provide guidelines and examples of regular expressions, for example, Regexlib and RegexBuddy, or even ChatGPT.
- You can also test your regular expression on various websites, for example, Regex101 (Select the Java 8 option in the Flavor panel).
The referenced websites serve only as examples. The use of ChatGPT or other generative AI products and services is at your own risk. Collibra is not responsible for the privacy, confidentiality, or protection of the data you submit to such products or services, and has no liability for such use.
ImportantThe required format of the regular expression is different between the UI and the API. In the API backslashes must be added twice (escaped). In the UI, this is not needed. For example: In the
UI, use^\+?\d{1,3}?\d{1,4}$, and in the API, use^\\+?\d{1,3}?\\d{1,4}$.ValuesThis field appears if you select a classification rule of the type List of values.
Add the values that define a specific data class.
ExampleA data class for T-shirt sizes based on a list of values could be:
SMLsmallmediumlargeImportant- The number of values in a list is limited to 1,000. Later, you will be able to add larger lists by uploading a file. This is not yet available in this phase.
- Add only one value per line.
- The maximum number of characters in a single list value is 10,000.
- Don’t add any leading or trailing blank characters in a value.
- The values are not case-sensitive, the value “small” in the list will also be a match with the values “Small” and “SMALL”.
DescriptionA description of the classification rule.
- Multiple websites provide guidelines and examples of regular expressions, for example, Regexlib and RegexBuddy, or even ChatGPT.
- Click Save.
The classification rule for the data class is configured.
A new section appears. If you expand the section, the details are shown. - If needed click Add new rule to add another classification rule to the data class.
The maximum number of rules in a data class is 25.
What's next?
Import out-of-the-box data classes
Go to some examples