When and how to use the Character tokenizer

Important Tokenizer settings in Collibra Console are being retired. To modify or remove existing configurations, contact Collibra Support.

The Character tokenizer splits the search text into tokens (individual terms) when it encounters a character that is not in a defined set of allowed characters. This defined set contains the characters that you add to the parameter map when editing the tokenizer settings, in addition to the following set of characters: 0-9a-zA-Z

When editing the tokenizer settings, consider using the Character tokenizer if you know which characters should be allowed during a search, and add those characters to the allowedCharacters parameter. The characters that you add, together with the set 0-9a-zA-Z, form the defined set. For example, if you added -' to the allowedCharacters parameter, then the defined set is as follows: 0-9a-zA-Z-'

Note You don't need to add 0-9a-zA-Z to the allowedCharacters parameter because it is part of the default set of allowed characters for the Character tokenizer.

Tokenizer settings

Example 

Suppose that the allowedCharacters parameter contains an asterisk (*). If you search for salesforce task under*review, then the Character tokenizer will treat * as an allowed character and generate the following tokens:

The Character tokenizer doesn't split the search text when it encounters * because it is an allowed character.

Example 

Suppose that an asset named salesforce_task exists in Collibra.