Data Classes
Data Classes are common checks that help to identify and tag column-level semantic traits of a rule, such as credit score, SSN, and ZIP code. With many out-of-the-box Data Class rules to choose from, as well as the ability to create your own custom Data Classes, save time and effort by reusing common data class checks when writing rules. With just a matter of clicks, you can quickly insert saved Data Classes from the Rule Workbench to run against your dataset, apply them to columns on the Profile page of a dataset, or add them to Data Categories in the Admin Console.
You can use the Category filter to see the data classes that an Admin has grouped together into data categories. For example, the following image shows Data Classes grouped by the Security Reference data category.
Adding a new Data Class
- On the sidebar menu, click Rules, then click Data Classes. The Templates page opens.
- In the upper right corner of the Templates page, click Add Data Class. The Add Data Class form opens.
- Enter the required information.
- Click Save.
- Your data class rule is available to add to a Data Category in the Admin Console, apply to a column on the Profile page of a dataset, or to use to create a dataset rule on the Rule Workbench.
Option | Description | Required |
---|---|---|
Name | The name of your data class rule. Enter a unique name to easily identify your Data Class rule. |
Yes |
Is Regex? | Select this option if the Value field contains a regular expression (regex). |
No |
Operator |
The symbol or keyword that your query uses to perform the specified operations on your data. Select an option from the dropdown menu. Note When Is Regex is selected, the operator is always rlike. |
Yes |
Value/Regex |
The query or regex that your Data Class applies to run against a dataset. Enter a query or expression for your Data Class to run against a dataset. Note Value displays when Is Regex is not selected, and Regex displays when Is Regex is selected. |
Yes |
Multi-line Input |
Because regex treats inputs as one continuous string, multi-line input informs the system to treat each line separately. Optionally enter a value or a comma delimited list of values, then click Validate Input to test them against your regex. Note Multi-line Input only displays when Is Regex is selected. |
No |
Examples 1, 2, and 3 | A bullet list of real-world values that your data class describes. Optionally enter a sample value that meets the criteria of your data class rule. For example, a credit score is a string of three numbers, such as 790. |
No |
Sensitivity Label |
Tags any findings from your rule as confidential data with a sensitivity badge on the Findings page. Admins can define Sensitivity labels on the Sensitive Labels page in the Admin Console. |
No |
Description | The description that helps identify your data class rule. It is best practice to base your description on what the Value or Regex fields query. |
No |
Column Name |
The name of the column being checked that a value must exactly equal to be classified as this data class. Tip You can leave this field blank to apply this data class to columns with different names or types. |
No |
Column Type |
The data type of the column being checked that a value must exactly equal to be classified as this data class. Tip You can leave this field blank to apply this data class to columns with different names or types. |
No |
Column Match % | The minimum percentage that a column being checked must exactly equal for it to be classified as this data class. The default value is 75%. |
Yes |
Allow Null |
Controls whether to include (selected) or exclude (not selected) NULL values from the query. This option is not selected by default. |
No |
Allow Empty |
Controls whether to include or exclude empty values from the query. This option is not selected by default. Select the checkbox option to include empty values in the query, or deselect it to exclude them. |
No |
Using a Data Class rule
You can use Data Class rules to identify the desired patterns for data. Unlike custom and Template rules, Data Classes return results for all values that do not match their exact expressions.
Example
When you run the rule Column A < 50
against a dataset with 5 rows where the values in Column A are 0, 1, 100, 200, and 300, custom and Template rules will return just 2 break records for the rows with the values of 0 and 1, whereas Data Class rules will return 3 break records for the rows with the values of 100, 200, and 300.
To use a data class rule to run against a dataset, follow these steps.
- Click the in the sidebar menu, then click Rule Builder. The Dataset Rules page opens.
- There are two ways to open the Rule Workbench.
- Click the Type dropdown menu and scroll down to the Data Class section, then select your Data Class rule. The Data Class rule populates in the query builder.
- Enter a name for your rule in the Rule Name input field.
- Select a primary column from the Primary Col dropdown menu.
- Optionally click to add a secondary dataset.
- Enter the name of your secondary dataset in the search bar when the Add Secondary Dataset dialog appears, then click Add Secondary Dataset.
- Click Validate to validate your expression.
- Optionally click Beautify to format your SQL query in a more readable layout.
- Click Save to apply your rule to your dataset.
Important Rule names cannot contain spaces.
Note Primary Column is required when you use Data Class rules.
Out-of-the-box Data Classes
Collibra Data Quality & Observability has 57 out-of-the-box (OOTB) Data Classes. OOTB Data Classes cannot be edited or deleted.
Name | Value | Description |
---|---|---|
AMEX |
|
Validates American Express credit card numbers. |
CELSIUS |
|
Checks for single temperature values in Celsius, and multiple temperature values within a specified range. |
CREDIT CARD |
|
Checks if the value does not match the credit card format. |
CREDIT SCORE |
|
Checks a range of credit scores between a poor 300 and an excellent 850. |
CURRENCY CD |
|
Checks ISO currency codes. |
CUSIP |
|
Checks CUSIP numbers. The length is 9 characters. Characters 1-3 are numbers. Characters 4-8 are letters or numbers. Characters 6-8 can also represent *, @, and #. Character 9 is a check digit. |
CUSIP ID |
|
A looser definition of CUSIP that matches to many international CUSIP numbers. The length is 9 characters. Characters 1-3 are numbers. Characters 4-8 are letters or numbers. Characters 6-8 can also represent *, @, and #. Character 9 is a check digit. |
DATE MM/DD/YY |
|
Checks if the date is in MM/DD/YY format. |
DATE MM/DD/YYYY |
|
Checks if the date is in MM/DD/YYYY format. |
DATE NO FMT |
|
Checks if the date is all numeric without formatting. |
DATE YYYY-MM-DD |
|
Checks if the date format is in YYYY-MM-DD. |
DEA NUMBER |
|
Checks the formatting of a Drug Enforcement Agency number. |
DISCOVER |
|
Validates Discover credit card numbers. |
EIN |
|
Checks if a value does not match the correct Employer Identification Number format. |
|
Checks if a value does not match an email format. |
|
FAHRENHEIT |
|
Checks for single temperature values in Fahrenheit, and multiple temperature values within a specified range. |
FDA NDC BILLING |
|
Checks if a value is not in Food and Drug Administration billing code format. |
FDA NDC PACKAGE |
|
Checks if a value is not in the Food and Drug Administration package code format. |
FDA NDC PRODUCT |
|
Checks if a value is not in the Food and Drug Administration product code format. |
GENDER |
|
Checks if a value does not match a gender format. |
GUID |
|
Checks if a value is not a GUID. |
ICD10 DIAGNOSIS |
|
Checks if a value is not in ICD10 diagnosis code format. |
ICD10 PROCEDURE |
|
Checks if a value is not in ICD10 procedure code format. |
ICD9 DIAGNOSIS |
|
Checks if a value is not in ICD9 diagnosis code format. |
ICD9 PROCEDURE |
|
Checks if a value is not in ICD9 procedure code format. |
INT PASSPORT |
|
Checks if a value does not match an international passport format. |
IP ADDRESS |
|
Checks if a value does not match an IP address format. |
ISIN |
|
Checks if a value does not match an International Securities Identification Number. |
LAT |
|
Validates that latitude coordinates are between -89 and +89 degrees, or -89.000001 and +899.000001 degrees. |
LNG |
|
Validates that longitude coordinates are between -179 and +179 degrees, or -179.000001 and +179.000001 degrees. |
MASTERCARD |
|
Validates Mastercard credit card numbers. |
MIC |
|
Market Identifier Code, such as NYSE. |
MONEY |
|
Money format including $ and €. |
NAME |
|
A first name, last name, or full name. |
NAME TITLE |
|
Checks if a value contains a title, such as Mr., Mrs. , and Dr. |
NHS NUMBER |
|
Checks the formatting of a National Health Service number. |
NPI NUMBER |
|
Checks the formatting of a National Provider Identifier number. |
OPCS4 |
|
Checks if a value is not in OPCS-4 format. |
PASSWORD MEDIUM |
|
Checks password values containing 1 number, 1 letter, and 1 uppercase letter. |
PERCENT |
|
Checks that a string is a valid percentage. |
PHONE |
|
Checks if a value does not match a phone number format. |
RIC |
|
Validates that a Reuters Instrument Code contains 1-4 uppercase letters.1-2 uppercase letters. For example, ABCD.EF. |
ROUTING NUM |
|
Validates that a string value matches a valid bank Routing Transit Number (RTN), such as 0134567890. |
SEDOL |
|
Validates that the alphanumeric value is a valid Stock Exchange Daily Official List (SEDOL) code, such as B123456. |
SHIRT SIZE |
|
Validates that a value contains a valid shirt size. |
SSN |
|
3 digits dash 2 digits dash 4 digits. |
STATE CD |
|
Checks if a value does not match a state code format. |
STATE NAME |
|
Checks if a value does not match a state name format. |
STREET ADDRESS |
|
Checks that a string value is in a valid US street address format, such as 123 Broadway. |
SUFFIX |
|
Validates that a value contains a valid suffix after surnames, such as Jr., Sr., and II. |
SYMBOL |
|
Validates that a stock symbol contains 1-4 uppercase letters, an optional colon :, period ., or hyphen -, and 1 uppercase letter. For example, ABCD:E. |
TIME HHMMSS |
|
Time in a HHMMSS format, such as 23:59:59 or 08:30:00. |
UK POSTAL CD |
|
Checks that a string value is in a valid UK postal code format. |
URL |
|
Checks that a string value is in a valid website URL format. |
VIN |
|
Checks if a value does not match a US Vehicle Identification Number format. |
VISA |
|
Validates Visa credit card numbers. |
ZIP CODE |
|
Must have 5 digits between 0-9 and can have an additional 4 digits. |