Data Classes

Data Classes are common checks that help to identify and tag column-level semantic traits of a rule, such as credit score, SSN, and ZIP code. With many out-of-the-box Data Class rules to choose from, as well as the ability to create your own custom Data Classes, save time and effort by reusing common data class checks when writing rules. With just a matter of clicks, you can quickly insert saved Data Classes from the Rule Workbench to run against your dataset, apply them to columns on the Profile page of a dataset, or add them to Data Categories in the Admin Console.

You can use the Category filter to see the data classes that an Admin has grouped together into data categories. For example, the following image shows Data Classes grouped by the Security Reference data category.

image of the data class rules page

Adding a new Data Class

  1. On the sidebar menu, click Rules wrench iconRules, then click Data Classes.
  2. The Templates page opens.
  3. In the upper right corner of the Templates page, click Add Data Class.
  4. The Add Data Class form opens.
  5. Enter the required information.
  6. Option Description Required
    Name The name of your data class rule. Enter a unique name to easily identify your Data Class rule.

    Yes

    Is Regex? Select this option if the Value field contains a regular expression (regex).

    No

    Operator

    The symbol or keyword that your query uses to perform the specified operations on your data.

    Select an option from the dropdown menu.

    Note When Is Regex is selected, the operator is always rlike.

    Yes

    Value/Regex

    The query or regex that your Data Class applies to run against a dataset.

    Enter a query or expression for your Data Class to run against a dataset.

    Note Value displays when Is Regex is not selected, and Regex displays when Is Regex is selected.

    Yes

    Multi-line Input

    Because regex treats inputs as one continuous string, multi-line input informs the system to treat each line separately.

    Optionally enter a value or a comma delimited list of values, then click Validate Input to test them against your regex.

    Note Multi-line Input only displays when Is Regex is selected.

    No

    Examples 1, 2, and 3 A bullet list of real-world values that your data class describes. Optionally enter a sample value that meets the criteria of your data class rule. For example, a credit score is a string of three numbers, such as 790.

    No

    Sensitivity Label

    Tags any findings from your rule as confidential data with a sensitivity badge on the Findings page. Admins can define Sensitivity labels on the Sensitive Labels page in the Admin Console.

    No

    Description The description that helps identify your data class rule. It is best practice to base your description on what the Value or Regex fields query.

    No

    Column Name

    The name of the column being checked that a value must exactly equal to be classified as this data class.

    Tip You can leave this field blank to apply this data class to columns with different names or types.

    No

    Column Type

    The data type of the column being checked that a value must exactly equal to be classified as this data class.

    Tip You can leave this field blank to apply this data class to columns with different names or types.

    No

    Column Match % The minimum percentage that a column being checked must exactly equal for it to be classified as this data class. The default value is 75%.

    Yes

    Allow Null

    Controls whether to include (selected) or exclude (not selected) NULL values from the query. This option is not selected by default.

    No

    Allow Empty

    Controls whether to include or exclude empty values from the query. This option is not selected by default.

    Select the checkbox option to include empty values in the query, or deselect it to exclude them.

    No

  7. Click Save.
  8. Your data class rule is available to add to a Data Category in the Admin Console, apply to a column on the Profile page of a dataset, or to use to create a dataset rule on the Rule Workbench.

Using a Data Class rule

You can use Data Class rules to identify the desired patterns for data. Unlike custom and Template rules, Data Classes return results for all values that do not match their exact expressions.

Example 
When you run the rule Column A < 50 against a dataset with 5 rows where the values in Column A are 0, 1, 100, 200, and 300, custom and Template rules will return just 2 break records for the rows with the values of 0 and 1, whereas Data Class rules will return 3 break records for the rows with the values of 100, 200, and 300.

To use a data class rule to run against a dataset, follow these steps.

  1. Click the Rules wrench icon in the sidebar menu, then click Rule Builder.
  2. The Dataset Rules page opens.
  3. There are two ways to open the Rule Workbench.
    1. From an existing rule on the Rules tab, click Actions, then click Edit.
      The Rule Workbench page opens with the preferences of your existing rule.
    2. Click Add Rule in the upper right corner to create a new rule.
      The Rule Workbench page opens.
  4. Click the Type dropdown menu and scroll down to the Data Class section, then select your Data Class rule.
  5. The Data Class rule populates in the query builder.
  6. Enter a name for your rule in the Rule Name input field.
  7. Important Rule names cannot contain spaces.

  8. Select a primary column from the Primary Col dropdown menu.
  9. Note Primary Column is required when you use Data Class rules.

  10. Optionally click to add a secondary dataset.
    1. Enter the name of your secondary dataset in the search bar when the Add Secondary Dataset dialog appears, then click Add Secondary Dataset.
  11. Click validate rule icon Validate to validate your expression.
  12. Optionally click prettify sql icon Beautify to format your SQL query in a more readable layout.
  13. Click Save to apply your rule to your dataset.

Out-of-the-box Data Classes

Collibra Data Quality & Observability has 57 out-of-the-box (OOTB) Data Classes. OOTB Data Classes cannot be edited or deleted.

Name Value Description
AMEX

^(?<amex>3[47][0-9]{13})$

Validates American Express credit card numbers.
CELSIUS

^(([0-9°c\\s]+)(?:-[0-9°]+c))|([0-9°\\s]+c)$

Checks for single temperature values in Celsius, and multiple temperature values within a specified range.
CREDIT CARD

^((4\\d{3})|(5[1-5]\\d{2})|(6011))-?\\d{4}-?\\d{4}-?\\d{4}|3[4,7]\\d{13}$

Checks if the value does not match the credit card format.
CREDIT SCORE

300 and 850

Checks a range of credit scores between a poor 300 and an excellent 850.
CURRENCY CD
('AED','AFN','ALL','AMD','ANG','AOA','ARS','AUD','AWG','AZN','BAM',
'BBD','BDT','BGN','BHD','BIF','BMD','BND','BOB','BRL','BSD',
'BTN','BWP','BYN','BZD','CAD','CDF','CHF','CLP','CNY','COP',
'CRC','CUC','CUP','CVE','CZK','DJF','DKK','DOP','DZD','EGP',
'ERN','ETB','EUR','FJD','FKP','GBP','GBp','GEL','GGP','GHS',
'GIP','GMD','GNF','GTQ','GYD','HKD','HNL','HRK','HTG','HUF',
'IDR','ILS','IMP','INR','IQD','IRR','ISK','JEP','JMD','JOD',
'JPY','KES','KGS','KHR','KMF','KPW','KRW','KWD','KYD','KZT',
'LAK','LBP','LKR','LRD','LSL','LYD','MAD','MDL','MGA','MKD',
'MMK','MNT','MOP','MRU','MUR','MVR','MWK','MXN','MYR','MZN',
'NAD','NGN','NIO','NOK','NPR','NZD','OMR','PAB','PEN','PGK',
'PHP','PKR','PLN','PYG','QAR','RON','RSD','RUB','RWF','SAR',
'SBD','SCR','SDG','SEK','SGD','SHP','SLL','SOS','SPL*','SRD',
'STN','SVC','SYP','SZL','THB','TJS','TMT','TND','TOP','TRY',
'TTD','TVD','TWD','TZS','UAH','UGX','USD','UYU','UZS','VEF',
'VND','VUV','WST','XAF','XCD','XDR','XOF','XPF','YER','ZAR',
'ZMW','ZWD')
Checks ISO currency codes.
CUSIP

^[0-9]{3}[a-zA-Z0-9]{2}[a-zA-Z0-9*@#]{3}[0-9]$

Checks CUSIP numbers. The length is 9 characters.

Characters 1-3 are numbers.

Characters 4-8 are letters or numbers.

Characters 6-8 can also represent *, @, and #.

Character 9 is a check digit.

CUSIP ID

^[a-zA-Z0-9]{5}[a-zA-Z0-9*@#]{3}[0-9]$

A looser definition of CUSIP that matches to many international CUSIP numbers. The length is 9 characters.

Characters 1-3 are numbers.

Characters 4-8 are letters or numbers.

Characters 6-8 can also represent *, @, and #.

Character 9 is a check digit.

DATE MM/DD/YY

^\\d{1,2}/\\d{2}/\\d{2}$

Checks if the date is in MM/DD/YY format.
DATE MM/DD/YYYY

^\\d{1,2}/\\d{2}/\\d{4}$

Checks if the date is in MM/DD/YYYY format.
DATE NO FMT

^[0-9]{8}$

Checks if the date is all numeric without formatting.
DATE YYYY-MM-DD

^\\d{4}-\\d{2}-\\d{2}

Checks if the date format is in YYYY-MM-DD.
DEA NUMBER

^([ABFGMPR][\\w9]\\d{7})([-]\\d{3}\\w{2})?$

Checks the formatting of a Drug Enforcement Agency number.
DISCOVER

^(?<discover>5[1-5][0-9]{14})$

Validates Discover credit card numbers.
EIN

^[1-9]\\d?-\\d{7}$

Checks if a value does not match the correct Employer Identification Number format.
EMAIL

'%_@__%.__%'

Checks if a value does not match an email format.

FAHRENHEIT

^(([0-9°f\\s]+)(?:-[0-9°]+f))|([0-9°\\s]+f)$

Checks for single temperature values in Fahrenheit, and multiple temperature values within a specified range.
FDA NDC BILLING

^(0\\d{4}-?\\d{4}-?\\d{2}|\\d{5}-?0\\d{3}-?\\d{2}|\\d{5}-?\\d{4}-?0\\d{1})$

Checks if a value is not in Food and Drug Administration billing code format.
FDA NDC PACKAGE

^(\\d{4}-\\d{4}-\\d{2}|\\d{5}-\\d{3}-\\d{2}|\\d{5}-\\d{4}-\\d{1})$

Checks if a value is not in the Food and Drug Administration package code format.
FDA NDC PRODUCT

^(\\d{4}-\\d{4}|\\d{5}-\\d{3}|\\d{5}-\\d{4})$

Checks if a value is not in the Food and Drug Administration product code format.
GENDER

^(?:m|M|male|Male|MALE|f|F|female|Female|FEMALE)$

Checks if a value does not match a gender format.
GUID

^[{]?[0-9a-fA-F]{8}-([0-9a-fA-F]{4}-){3}[0-9a-fA-F]{12}[}]?$

Checks if a value is not a GUID.
ICD10 DIAGNOSIS

^([A-TV-Z][0-9][A-TV-Z0-9])([.]{0,1})[A-TV-Z0-9]{0,4}$

Checks if a value is not in ICD10 diagnosis code format.
ICD10 PROCEDURE

^[0-9BCDFGHX][0-9A-HJ-NP-Z]{2}[0-9A-HJ-NP-Z]{0,4}$

Checks if a value is not in ICD10 procedure code format.
ICD9 DIAGNOSIS

^(([V]\\d{2}[.]?\\d{0,2})|([E][089]\\d{2}[.]?\\d?)|(\\d{3}[.]?\\d{0,2}))$

Checks if a value is not in ICD9 diagnosis code format.
ICD9 PROCEDURE

^(\\d{2}[.]?\\d{1,2})$

Checks if a value is not in ICD9 procedure code format.
INT PASSPORT

^[A-Z0-9<]{9}[0-9]{1}[A-Z]{3}[0-9]{7}[A-Z]{1}[0-9]{7}[A-Z0-9<]{14}[0-9]{2}$

Checks if a value does not match an international passport format.
IP ADDRESS

^(([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\\.){3}([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])$

Checks if a value does not match an IP address format.
ISIN

^\\b([A-Z]{2})((?![A-Z]{10}\\b)[A-Z0-9]{10})\\b

Checks if a value does not match an International Securities Identification Number.
LAT

^(\\+|-)?(?:90(?:(?:\\.0{1,6})?)|(?:[0-9]|[1-8][0-9])(?:(?:\\.[0-9]{1,6})?))$

Validates that latitude coordinates are between -89 and +89 degrees, or -89.000001 and +899.000001 degrees.
LNG

^(\\+|-)?(?:180(?:(?:\\.0{1,6})?)|(?:[0-9]|[1-9][0-9]|1[0-7][0-9])(?:(?:\\.[0-9]{1,6})?))$

Validates that longitude coordinates are between -179 and +179 degrees, or -179.000001 and +179.000001 degrees.
MASTERCARD

^(?<mastercard>6(?:011|5[0-9]{2})[0-9]{12})$

Validates Mastercard credit card numbers.
MIC

^[A-Z]{4}$

Market Identifier Code, such as NYSE.
MONEY

^\\p{Sc}[0-9]

Money format including $ and €.
NAME

^\\p{L}+[\\p{L}\\p{Z}\\p{P}]{0,}

A first name, last name, or full name.
NAME TITLE

^(Mr|Ms|Dr|Sr|Miss|Mr.|Ms.|Mrs|Mrs.|Dr.|Sr.|Rev|Honorable)\\.?$

Checks if a value contains a title, such as Mr., Mrs. , and Dr.
NHS NUMBER

^(\\d{3}\\s?\\d{3}\\s?\\d{4})$

Checks the formatting of a National Health Service number.
NPI NUMBER

^(\\d{10})$

Checks the formatting of a National Provider Identifier number.
OPCS4

^(([A-Z]\\d{2})(\\.-|\\.?\\d)?)$

Checks if a value is not in OPCS-4 format.
PASSWORD MEDIUM

^(?=.*[0-9])(?=.*[a-z])(?=.*[A-Z])(?=.*[@#$%^&+=])(?=\\S+$).{8,20}$

Checks password values containing 1 number, 1 letter, and 1 uppercase letter.
PERCENT

^\\d+(?:\\.\\d+)?%$

Checks that a string is a valid percentage.
PHONE

^(\\+\\d{1,2}\\s)?\\(?\\d{3}\\)?[\\s.-]\\d{3}[\\s.-]\\d{4}$

Checks if a value does not match a phone number format.
RIC

[A-Z]{1,4}\\.[A-Z]{1,2}

Validates that a Reuters Instrument Code contains 1-4 uppercase letters.1-2 uppercase letters. For example, ABCD.EF.
ROUTING NUM

^((0[0-9])|(1[0-2])|(2[1-9])|(3[0-2])|(6[1-9])|(7[0-2])|80)([0-9]{7})$

Validates that a string value matches a valid bank Routing Transit Number (RTN), such as 0134567890.
SEDOL

^(?<Sedol>[B-Db-dF-Hf-hJ-Nj-nP-Tp-tV-Xv-xYyZz\\d]{6}\\d)$

Validates that the alphanumeric value is a valid Stock Exchange Daily Official List (SEDOL) code, such as B123456.
SHIRT SIZE

('L','M','S','XS','XL','XXL','2XL','3XL','l','m','s','xs','xl','xxl','Small','Medium','Large','X-Large')

Validates that a value contains a valid shirt size.
SSN

^(?!(000|666|9))\\d{3}-(?!00)\\d{2}-(?!0000)\\d{4}$

3 digits dash 2 digits dash 4 digits.
STATE CD
('AL','AK','AZ','AR','CA','CO','CT','DE','FL','GA','GU','HI','ID','IL',
'IN','IA','KS','KY','LA','ME','MD','MA','MI','MN','MS','MO','MP','MT',
'NE','NV','NH','NJ','NM','NY','NC','ND','OH','OK','OR','PA','PR','RI',
'SC','SD','TN','TX','UT','VT','VA','VI','WA','WV','WI','WY','AS','DC')
Checks if a value does not match a state code format.
STATE NAME
('Alabama','Arkansas','Arizona','Alaska','California','South Dakota',
'Colorado','Connecticut','Delaware','Florida','Georgia','Guam','Hawaii',
'Idaho','Illinois','Indiana','Kansas','Kentucky','Louisiana','Maryland',
'Massachusetts','Michigan', 'Missouri', 'Minnesota','Mississippi',
'Montana','Oregon','Nebraska','Northern Mariana Islands','Maine',
'Nevada','New Hampshire','New Jersey','New Mexico','New York',
'North Carolina','North Dakota','Ohio','Oklahoma','Pennsylvania',
'Puerto Rico','Iowa','Rhode Island','South Carolina','Tennessee',
'Texas','Utah','Vermont','Virgin Islands','Virginia','Washington',
'West Virginia','Wisconsin','Wyoming','American Samoa',
'District of Columbia')
Checks if a value does not match a state name format.
STREET ADDRESS

^\\d{1,3}.?\\d{0,3}\\s[a-zA-Z]{2,30}\\s[a-zA-Z]{2,15}

Checks that a string value is in a valid US street address format, such as 123 Broadway.
SUFFIX

('Jr','Jr.', 'Sr', 'Sr.', 'II', 'III', 'IV')

Validates that a value contains a valid suffix after surnames, such as Jr., Sr., and II.
SYMBOL

^([A-Z]{1,4})(:|\\.|-){0,1}([A-Z]{1,1})$

Validates that a stock symbol contains 1-4 uppercase letters, an optional colon :, period ., or hyphen -, and 1 uppercase letter. For example, ABCD:E.
TIME HHMMSS

^(\\d\\d:\\d\\d:\\d\\d)

Time in a HHMMSS format, such as 23:59:59 or 08:30:00.
UK POSTAL CD

^[A-Z]{1,2}[0-9R][0-9A-Z]? [0-9][ABD-HJLNP-UW-Z]{2}$

Checks that a string value is in a valid UK postal code format.
URL

(http|https|www.|ftp)(www.)?[a-zA-Z0-9@:%._\\+~#?&//=]{2,256}\\.[a-z]{2,6}\\b([-a-zA-Z0-9@:%._\\+~#?&//=]*)

Checks that a string value is in a valid website URL format.
VIN

^(?<wmi>[A-HJ-NPR-Z\\d]{3})(?<vds>[A-HJ-NPR-Z\\d]{5})(?<check>[\\dX])(?<vis>(?<year>[A-HJ-NPR-Z\\d])(?<plant>[A-HJ-NPR-Z\\d])(?<seq>[A-HJ-NPR-Z\\d]{6}))$

Checks if a value does not match a US Vehicle Identification Number format.
VISA

^(?<visa>4[0-9]{12}(?:[0-9]{3})?)$

Validates Visa credit card numbers.
ZIP CODE

^[0-9]{5}(?:-[0-9]{4})?$

Must have 5 digits between 0-9 and can have an additional 4 digits.