About data classes

Updated: May 8, 2026

Data classes are the groups used to classify data, such as email, phone number, or web browser. They are used to identify data patterns.

Users with the required permissions, such as data stewards, can create, update, remove, and merge data classes. They can also import out-of-the-box data classes and update them.

The automatic data classification method uses the classification rules defined in a data class to check if an asset can be classified with the data class. A data class can include multiple rules, and the rules can be of different types. A data class is assigned to a column as soon as one of the rules applies to the column.

Important

By default, rules based on column name and data type are evaluated before rules based on samples, such as regular expressions for data and lists of values for data. Rules based on samples are evaluated in the order in which they appear in the data class.

A data class includes the following elements:

Data class element	Description
Name	The name of the data class.
Enabled	Enabled indicates whether the data class needs to be taken into account during the data classification process. If a data class isn't enabled, the automatic data classification process doesn't consider this data class and the data class is unavailable when you manually classify a column. However, if a column is already classified with that data class, the classification is still valid. This option can be useful if the data class is not ready for use or if it is in testing phase.
Description	The description of the data class. The description can't exceed 10,000 characters.
Details
Minimum confidence threshold	Minimum confidence threshold is the confidence percentage that must be reached for the data class to be considered as a possible classification result. The confidence percentage refers to the percentage of values in a column that match at least one of the classification rules in a data class, for example, the regular expression. Enter a value between 0 and 100. The default value is 0. Confidence scores of 0 are never taken into account. Example If you add value 80 in this field, this data class will be suggested by the automatic data classification process only if the confidence percentage reaches 80 percent or higher.
Include empty values	Include empty values indicates if you want to include empty values in the confidence percentage calculation. The possible values are: Yes: If the value is set to true, empty values are taken into account by the data classification process when calculating the confidence percentage of a matching data class. No (default): If the value is set to false, only the non-empty values are taken into account by the data classification process when calculating the confidence percentage of a matching data class. This option can be used to receive an accurate confidence score for all data in a column. Example You have a column Z with 40 empty values and 60 phone numbers. You have a data class A with a regular expression to detect US phone numbers. If you set this option to False and you classify column Z, data class A could be suggested with a confidence percentage of 100. If you set this option to True and you classify column Z , data class A could be suggested but with a confidence percentage of only 60. Important Some regular expressions are constructed to allow a match with empty values. This means that, through the regular expression, empty values can be matched to the data class, which affects the confidence score. Example: This expression won't match empty values with the email data class: `^([a-zA-Z0-9._%\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,6})$` This expression will match empty values with the email data class: `^([a-zA-Z0-9._%\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,6})*$`
Column name filter	The Column name filter allows you to limit a data class to specific columns based on their name in the data source. The data class is considered only if the column's name in the data source matches one of the regular expressions in the filter. Add each regular expression in a separate field. To do so, enter one regular expression and click Save. A new filter field becomes available. Show how this works You can add up to 25 regular expressions in the Column name filter.
Column type filter	The Column type filter allows you to limit a data class to specific column types based on their data type in the data source. The data class will be considered only if the column's data type in the data source matches one of the specified data types. Using the column type filter makes the classification of dates, times, and time stamps easier because the data class can be restricted to those data types. Example If you select `date` in the Column type filter and you classify a column with data type BIGINT in the data source, the column won't be checked against this data class. The following list shows the available options in the Column type filter and their mapped SQL data types (java.sql.Types). boolean: BIT, BOOL, BOOLEAN date: DATE, DATETIME, TIME, YEAR, smalldatettime, datetimeoffset double: FLOAT, DOUBLE, DOUBLE PRECISION, DECIMAL, DEC, numeric, money, smallmoney, real int: TINYINT, SMALLINT, MEDIUMINT, INT, INTEGER, BIGINT string: CHAR, VARCHAR, TINYTEXT, TEXT, MEDIUMTEXT, LONGTEXT, ENUM, SET,nchar, nvarchar, ntext, xml, character, cidr, inet, json, macaddr, uuid, clob timestamp: TIMESTAMP
Examples	Some examples of values that match the classification rule for the data class. Add one example per line.
Classification rules	A classification rule is used by the data classification process to calculate the confidence score, which is a percentage that indicates the likelihood that the data class fits the data in an asset. A data class can contain multiple classification rules. A data class can contain various types of classification rules. For example: You have defined the email data class with a regular expression. However, the values “unknown”, “invalid”, and “missing” are also acceptable email values in your data source. You can add a list of values as a second rule on the email data class. For the full example, go to Example: Configuring a data class with two classification rules. The maximum number of rules in a data class is 25. Important By default, rules based on column name and data type are evaluated before rules based on samples, such as regular expressions for data and lists of values for data. Rules based on samples are evaluated in the order in which they appear in the data class. Show more information If the name of the Column asset matches the regular expression in the classification rule, the data class is applied, and no other rules are checked for that column. If the column's data type in the data source matches the data type in the classification rule, the data class is applied, and no other rules are checked for that column. If the column name and data type don't match, sample data from the column is evaluated against the other rules in the data class. These rules are processed in the order of appearance in the data class. When a rule matches a sample, the sample is considered a match, and the confidence score of the data class increases. At this point, the remaining rules are skipped for that sample. In that sense, it's important to add rules that are more likely to produce a match before others.
Description	A description of the classification rule.
Type	The type of classification rule. The possible values are Regular expression for column names, Data type, List of values for data, or Regular expression for data. Depending on your selection other fields appear. Add a Regular expression for column names rule to check the name of a column in the data source. Unlike the Column name filter, which makes the name a prerequisite to consider the data class, a rule based on name serves as a criteria to apply the data class. Add a Data type rule to check the data type of a column in the data source. Unlike the Column type filter, which makes the data type mandatory to consider the data class, a rule based on data type serves as a criteria to apply the data class. For an example, go to Example \| Importing data classes and starting automatic classification for a table. Add a Regular expression for data rule to validate a pattern, such as the format of email addresses. Add a List of values for data rule to check for specific, predefined options, such as T-shirt sizes.
Regular expression for column names	If you select Regular expression for column names, you need to complete the following fields. Regular expression In this case, the regular expression refers to the possible column names. Example A regular expression for the column name of email address `^email`, `^maddress`. A regular expression, also referred to as regex or regexp, is a sequence of characters that specifies a match pattern in text. Multiple regular expression grammar variants exist. We use the Java variant. Tip Multiple websites provide guidelines and examples of regular expressions, for example, Regexlib and RegexBuddy. You can also test your regular expression on various websites, for example, Regex101 (Select the Java 8 option in the Flavor sidebar). The referenced websites serve only as examples. The use of generative AI products and services is at your own risk. Collibra is not responsible for the privacy, confidentiality, or protection of the data you submit to such products or services, and has no liability for such use. Important We check regular expressions for potential vulnerabilities before saving them. The required format of the regular expression is different between the UI and the API. In the API backslashes must be added twice (escaped). In the UI, this is not needed. For example: In the UI, use `^\+?\d{1,3}?\d{1,4}$`, and in the API, use `^\\+?\d{1,3}?\\d{1,4}$`. Description A description of the classification rule.
Data type	If you select Data type, you need to complete the following fields. Data type Select one or more data types. Using this type of rule makes the classification of dates, times, and time stamps easier because you can check based on data type. During the automatic data classification, the data type of the column in the data source is verified against the selected data types in this classification rule. Example You create a rule with data type `date` and you classify a column with a data type DATETIME in the data source. As a result, the column will be classified with this data class. The following table shows the available options in the Data Type field and their matching SQL data types. boolean: BIT, BOOL, BOOLEAN date: DATE, DATETIME, TIME, YEAR, smalldatettime, datetimeoffset double: FLOAT, DOUBLE, DOUBLE PRECISION, DECIMAL, DEC, numeric, money, smallmoney, real int: TINYINT, SMALLINT, MEDIUMINT, INT, INTEGER, BIGINT string: CHAR, VARCHAR, TINYTEXT, TEXT, MEDIUMTEXT, LONGTEXT, ENUM, SET,nchar, nvarchar, ntext, xml, character, cidr, inet, json, macaddr, uuid, clob timestamp: TIMESTAMP Description A description of the classification rule.
List of values for data	If you select List of values for data, you need to complete the following fields. Upload a file Use this to upload a csv file with possible data class values. The file must contain one value per line. The maximum file size of 100 MB. This method is mandatory if you want to add more than 1,000 values. Download file If you download a file, the name of the file is: listofvalues_ID of the data class rule. Values Add the values that define a specific data class. Add only one value per line. The maximum number of characters in a single list value is 10,000. Don’t add any leading or trailing blank characters in a value. The values are not case-sensitive, the value “small” in the list will also be a match with the values “Small” and “SMALL”. The maximum total number of values in one data class is 25,000. This number can be spread over multiple classification rules. Example A data class for T-shirt sizes based on a list of values could be: `small` `medium` `large` Description A description of the classification rule.
Regular expression for data	If you select Regular expression for data, you need to complete the following fields. Regular expression In this case, the regular expression refers to the values in the columns. Example A regular expression for an email address can be `^[a-zA-Z0-9._%\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,6}$` A regular expression, also referred to as regex or regexp, is a sequence of characters that specifies a match pattern in text. Multiple regular expression grammar variants exist. We use the Java variant. Important We check regular expressions for potential vulnerabilities before saving them. The required format of the regular expression is different between the UI and the API. In the API backslashes must be added twice (escaped). In the UI, this is not needed. For example: In the UI, use `^\+?\d{1,3}?\d{1,4}$`, and in the API, use `^\\+?\d{1,3}?\\d{1,4}$`. Tip Multiple websites provide guidelines and examples of regular expressions, for example, Regexlib and RegexBuddy. You can also test your regular expression on various websites, for example, Regex101 (Select the Java 8 option in the Flavor sidebar). The referenced websites serve only as examples. The use of generative AI products and services is at your own risk. Collibra is not responsible for the privacy, confidentiality, or protection of the data you submit to such products or services, and has no liability for such use. Description A description of the classification rule.

About data classes

Related topics