When and how to use the Character tokenizer

The Character tokenizer splits the search text into tokens (individual terms) when it encounters a character that is not in a defined set of allowed characters. This defined set contains the characters that you add to the parameter map when editing the tokenizer settings, in addition to the following set of characters: 0-9a-zA-Z

When editing the tokenizer settings, consider using the Character tokenizer if you know which characters should be allowed during a search, and add those characters to the allowedCharacters parameter. The characters that you add, together with the set 0-9a-zA-Z, form the defined set. For example, if you added -' to the allowedCharacters parameter, then the defined set is as follows: 0-9a-zA-Z-'

Note You do not need to add 0-9a-zA-Z to the allowedCharacters parameter because it is part of the default set of allowed characters for the character tokenizer.

Tokenizer settings

Example 

Suppose that the allowedCharacters parameter contains an asterisk (*). If you search for salesforce task under*review, then the Character tokenizer will treat * as an allowed character and generate the following tokens:

  • salesforce
  • task
  • under*review

The Character tokenizer does not split the search text when it encounters * because it is an allowed character.

Example 

Suppose that an asset named salesforce_task exists in Collibra.

  • Behavior with the Standard tokenizer: If the Standard tokenizer is used, when you search for salesforce task (without the underscore), salesforce_task (with the underscore) will not be shown in the search results. The Standard tokenizer allows _ because it is one of the allowed characters in the Standard tokenizer pattern (a-zA-Z0-9_). As a result, salesforce_task is treated as salesforcetask.
  • Behavior with the Character tokenizer: If you want salesforce_task to be shown in the search results, then use the Character tokenizer and ensure that _ is not added to the allowedCharacters parameter. If _ is not added to the allowedCharacters parameter, then the tokenizer will split the search text when it encounters _. As a result, salesforce_task is treated as salesforce and task.