About data classes in the Unified Data Classification method

Data classes are the different groups you want to use to classify your data, for example, email, phone number, and web browser. You can create, update, and remove data classes. You can also import out-of-the-box data classes and update them.
Currently, you cannot merge data classes using the Unified Data Classification method.

A data class in the Unified Data Classification method (Beta) consists of the following elements:

Data class element Description
Name The name of the data class.
Enabled

Switch to indicate whether this data class needs to be taken into account during the classification process.
If a data class is not enabled, the automatic data classification doesn't consider this data class and the data class is also unavailable when you manually classify a column.
However, if a column is already classified with that data class, the classification is still valid.

This option can be useful if the data class is not ready for use or if it is in testing phase.

Description The description of the data class.
Details  
Minimum confidence threshold

The confidence percentage that must be reached for the data class to be considered as a classification result. The confidence percentage is the percentage of values in the column that match the classification rule, for example, the regular expression.

Enter a value between 0 and 100.
The default value is 0.

Example If you add value 80 in this field, the data class is returned by the automatic classification process only if the confidence percentage reaches 80 percent or higher.

Tip Confidence scores of 0 are never taken into account.

Include empty values

Indicates if you want to include empty values in the confidence percentage calculation.
The possible values are:

  • True True: If the value is set to true, empty values are taken into account by the classification process when calculating the confidence percentage of a matching data class.
  • False False (Default): If the value is set to false, only the non-empty values are taken into account by the classification process when calculating the confidence percentage of a matching data class.

This option can be used to receive an accurate confidence score for all data in a column.

Example 

You have a column Z with 40 empty values and 60 phone numbers. You have a data class A with a regular expression to detect US phone numbers.

  • If you set this option to False and you classify column Z, data class A could be suggested with a confidence percentage of 100.
  • If you set this option to True and you classify column Z , data class A could be suggested but with a confidence percentage of only 60.

Important Some regular expressions are constructed to allow a match with empty values. This means that, through the regular expression, empty values can be matched to the data class, which affects the confidence score.
Example:
This expression won't match empty values with the email data class:
^([a-zA-Z0-9._%\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,6})$
This expression will match empty values with the email data class:
^([a-zA-Z0-9._%\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,6})*$

Examples
Some examples of values that match the classification rule for the data class.

Classification rules

A data classification rule is used by the data classification process to calculate the confidence score, which is a percentage that indicates the likelihood that the data class fits the data in an asset.

A data class can contain multiple data classification rules. Each rule is verified against the data, and the data class is assigned as soon as one of the rules applies.

Example  You have defined the email data class with a regular expression. However, the values “unknown”, “invalid”, and “missing” are also acceptable email values in your data source. You can add a list of values as a second rule on the email data class. For the full example, go to Example: Configuring a data class with two classification rules.

Type

The possible values are: Regular expression or List of values.
Depending on your selection other fields appear.

Regular expression

This field appears if you select a classification rule of the type Regular expression.

A regular expression, also referred to as regex or regexp, is a sequence of characters that specifies a match pattern in text. Multiple regular expression grammar variants exist. We use the Java variant.

Example A regular expression for an email address can be ^[a-zA-Z0-9._%\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,6}$

Tip 
  • Multiple websites provide guidelines and examples of regular expressions, for example, Regexlib and RegexBuddy, or even ChatGPT.
  • You can also test your regular expression on various websites, for example, Regex101 (Select the Java 8 option in the Flavor panel).

The referenced websites serve only as examples. The use of ChatGPT or other generative AI products and services is at your own risk. Collibra is not responsible for the privacy, confidentiality, or protection of the data you submit to such products or services, and has no liability for such use.

Important 

The required format of the regular expression is different between the UI and the API. In the API backslashes must be added twice (escaped). In the UI, this is not needed. For example: In the
UI, use ^\+?\d{1,3}?\d{1,4}$, and in the API, use ^\\+?\d{1,3}?\\d{1,4}$.

Values

This field appears if you select a classification rule of the type List of values.

Add the values that define a specific data class.

Example 

A data class for T-shirt sizes based on a list of values could be:

S

M

L

small

medium

large

Important 
  • The number of values in a list is limited to 1,000. Later, you will be able to add larger lists by uploading a file. This is not yet available in this phase.
  • Add only one value per line.
  • The maximum number of characters in a single list value is 10,000.
  • Don’t add any leading or trailing blank characters in a value.
  • The values are not case-sensitive, the value “small” in the list will also be a match with the values “Small” and “SMALL”.
Description

A description of the classification rule.

Tip 

The maximum number of rules in a data class is 25.

About out-of-the-box data classes

Out-of-the-box data classes are created by Collibra. You can decide if and which out-of-the-box data classes you want to use. This allows you to have only the data classes you are interested in and to reduce the risk of similar, overlapping data classes.

Example In the out-of-the-box , we have the data classes: Credit card and Credit card: Visa. Both are overlapping because a Visa credit card is also a credit card. You can decide which data class you want to use depending on the granularity you need.

Once an out-of-the-box data class has been imported, it's considered as a regular data class, which means you can edit the data class and change its classification rules.

If you import out-of-the-box data classes again, we'll detect whether you have data classes with the same name. We'll inform you about this and indicate whether their rules are different. The following statuses are available:

Status Description
New This data class is not yet available in your environment.
You can import this data class without any risks that you erase existing data.
Exists (no changes)

A data class with the same name is already available in your environment and the definition of the different classification rules are the same.

This data class will not be imported, even if you select it for import. Data classes with this status are, by default, deselected for import.

Exists (changed)

A data class with the same name but with different classification rules is already available in your environment.

You can import this data class, but the current classification rules will be replaced by those in the out-of-the-box data class. Also the classification rule description will be updated.

Data classes with this status are, by default, deselected for import.

Important 

When we compare data classes to check if they were changed, we compare only the classification rules.

  • Global properties, such as data class description, confidence score threshold, and examples are not taken into account.
    If you import the out-of-the-box data class, these properties are not updated.

  • Classification rule descriptions are not taken into account.
    However, if you import the out-of-the-box data class, the classification rules, including the classification rule descriptions, are updated.