Create a data class

Important 

In Collibra 2024.05, we launched a new user interface (UI) for Collibra Platform! You can learn more about this latest UI in the UI overview.

Use the following options to see the documentation in the latest UI or in the previous, classic UI:

You can create data classes in multiple locations:

  • The Data Classification page in the Stewardship application.
  • The asset pages where you can update the classification, such as Column asset pages.
  • Asset views where the Data Classification column has been added.
Tip 
  • If you want to use automatic data classification, create data classes on the Data Classification page in the Stewardship application. This allows immediate configuration of the classification rules.
  • To add classification rules to a data class, or to update and delete data classes, you must always go to the Data Classification page in the Stewardship application.

Prerequisites

  • You have a global role that has the Product Rights > Catalog global permission.
  • You have a global role that has the Data Stewardship Manager global permission.
  • You have a global role that has the Classification > Data Classes > Read global permission.
  • You have a global role that has the Classification > Data Classes > Add global permission.

For more information, go to Required permissions.

Steps

  1. On the main toolbar, click Products iconStewardship.
  2. Click the Data Classification tab.
  3. If the data class doesn't exist yet:
    1. Click Add.
    2. Type the name of the data class and press Enter.
    3. Click Create.
  4. Select the data class that you want to configure.
  5. The data class parameters appear in a pane on the right-hand side.
  6. Hover over the data class name and click Preview.
    The data class parameters appear in a pane on the right-hand side.
  7. Optionally, change the name by clicking the Name field, typing the name, and clicking the Save icon.
  8. Make sure the data class is enabled, unless you don't want the data classification process to use it yet.
  9. Optionally, add a description by clicking the Description field, typing the description, and clicking the Save icon.
  10. Optionally, add a description by clicking the Edit icon next to the Description field.
  11. Open the Details section.
  12. Complete the fields as required.
    Data class elementDescription
    Minimum confidence threshold

    The confidence percentage that must be reached for the data class to be considered as a possible classification result. The confidence percentage refers to the percentage of values in a column that match at least one of the classification rules in a data class, for example, the regular expression.

    Enter a value between 0 and 100.
    The default value is 0.

    Example If you add value 80 in this field, this data class will be suggested by the automatic data classification process only if the confidence percentage reaches 80 percent or higher.

    Tip Confidence scores of 0 are never taken into account.

    Include empty values

    Include empty values indicates if you want to include empty values in the confidence percentage calculation.
    The possible values are:

    • True TrueYes: If the value is set to true, empty values are taken into account by the data classification process when calculating the confidence percentage of a matching data class.
    • False FalseNo (default): If the value is set to false, only the non-empty values are taken into account by the data classification process when calculating the confidence percentage of a matching data class.

    This option can be used to receive an accurate confidence score for all data in a column.

    Example 

    You have a column Z with 40 empty values and 60 phone numbers. You have a data class A with a regular expression to detect US phone numbers.

    • If you set this option to False and you classify column Z, data class A could be suggested with a confidence percentage of 100.
    • If you set this option to True and you classify column Z , data class A could be suggested but with a confidence percentage of only 60.

    Important Some regular expressions are constructed to allow a match with empty values. This means that, through the regular expression, empty values can be matched to the data class, which affects the confidence score.
    Example:
    This expression won't match empty values with the email data class:
    ^([a-zA-Z0-9._%\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,6})$
    This expression will match empty values with the email data class:
    ^([a-zA-Z0-9._%\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,6})*$

    Column name filter
    The Column name filter allows you to limit a data class to specific columns based on their name in the data source. The data class is considered only if the column's name in the data source matches one of the regular expressions in the filter.

    Add each regular expression in a separate field. To do so, enter one regular expression and click Save. A new filter field becomes available.

    Tip 
    • You don't need to worry about capitalization in your regular expression because regular expressions are not case sensitive.
    • You can add up to 25 regular expressions in the Column name filter.
    Column type filter
    The Column type filter allows you to limit a data class to specific column types based on their data type in the data source. The data class will be considered only if the column's data type in the data source matches one of the specified data types.

    Example  If you select date in the Column type filter and you classify a column with data type BIGINT in the data source, the column won't be checked against this data class.

    Tip Using the column type filter makes the classification of dates, times, and time stamps easier because the data class can be restricted to those data types.

    The following table shows the available options in the Column type filter and their matching SQL data types.

    OptionMapped SQL data types (java.sql.Types)
    booleanBIT, BOOL, BOOLEAN,
    dateDATE, DATETIME, TIME, YEAR, smalldatettime, datetimeoffset
    doubleFLOAT, DOUBLE, DOUBLE PRECISION, DECIMAL, DEC, numeric, money, smallmoney, real
    intTINYINT, SMALLINT, MEDIUMINT, INT, INTEGER, BIGINT
    stringCHAR, VARCHAR, TINYTEXT, TEXT, MEDIUMTEXT, LONGTEXT, ENUM, SET,nchar, nvarchar, ntext, xml, character, cidr, inet, json, macaddr, uuid, clob
    timestampTIMESTAMP
    Examples

    Some examples of values that match the classification rule for the data class.

    Add one example per line.

    To change a value, click the Edit icon .

    To save the value, click the Save icon.

  13. Open the Classification rules section.
  14. Click Add new rule.

    A data class without a classification rule can be used only for manual classification.
    To allow the automatic data classification process to pick up the data class, you need to add at least one classification rule.
    A data class can include multiple rules, and the rules can be of different types.

  15. From the Type list, select the type of classification rule that you want to add to the data class. The possible values are: Regular expression for column names, Data type, Regular expression for data, and List of values for data.
    Tip 
    • Add a regular expression for column names rule to check the name of a column in the data source.
      Unlike the Column name filter, which makes the name a prerequisite to consider the data class, a rule based on name serves as a criteria to apply the data class.
    • Add a data type rule to check the data type of a column in the data source.
      Unlike the Column type filter, which makes the data type mandatory to consider the data class, a rule based on data type serves as a criteria to apply the data class. For an example, go to Example: Importing data classes, and starting the automatic classification for a table.
    • Add a regular expression for data rule to validate a pattern, such as the format of email addresses.
    • Add a list of values for data rule to check for specific, predefined options, such as T-shirt sizes.

    Depending on your selection, extra fields appear.

  16. Complete the fields as required.
  17. Click Save.
    The classification rule for the data class is configured.
    A new section appears. If you expand the section, the details are shown.
  18. If needed, click Add new rule to add another classification rule to the data class.
    • You can combine regular expression for column names, regular expression for data, list of values for data, and data type rules in one data class.
    • The maximum number of rules in a data class is 25.
    • During the automatic data classification process, each rule is verified and the data class is assigned as soon as one of the rules applies.
    • Important 

      By default, rules based on column name and data type are evaluated before rules based on samples, such as regular expressions for data and lists of values for data. Rules based on samples are evaluated in the order in which they appear in the data class.

What's next?

Go to some examples
Import out-of-the-box data classes
Merge data classes

Helpful resources

Contextualize your data with Unified Data Classification course in Collibra University