Customizing the search index

Before you customize the search index, it is important to understand how the indexing works.

In this topic

How search indexing works

All textual content in Collibra is stored in a search index to enable fast text searches. To populate the search index, the textual content is split into individual terms called tokens.

Each logical entity in Collibra—such as asset names, community names, domain names, text attributes, and comments—is stored in its own index document in the search index. An index document contains information about how many times a specific token occurs in the entity's content. For example:

  • An asset named Data Governance Policy would have an index document with the tokens data, governance, and policy.
  • The comment Data governance is important would also have an index document with the tokens data, governance, and important.

When you search for a text (for example, data governance), the engine splits your text into tokens (data and governance), looks for these tokens in the search index, and calculates a score for each matched index document.

An index document’s score reflects how well it matches a search token, based on relevance. This score is calculated using an algorithm that considers factors such as how often the search tokens appear in the document (term frequency), how unique those tokens are across all documents (inverse document frequency), and the length of the field being searched (shorter fields are often seen as more relevant). Documents that contain more of the tokens or rare tokens score higher.

The score is based on the following factors:

  • Frequency: How many times the tokens appear in the document. For example, if data governance appears multiple times in a document, that document will get a higher score.
  • Match size: How much of the document matches your search text. For example, a short document containing only data governance will likely score higher than a long document where data governance is mentioned once.

The engine returns index documents with the highest scores first, so the most relevant search results appear at the top. For example:

  • An asset named Data Governance Policy will have a high score because both tokens match exactly.
  • The comment Data governance is important will have a medium score because both tokens match but the document is longer.
  • A domain named Governance Framework will have a lower score because only one of the two tokens matches.

This process ensures that the most relevant information appears first in your search results.

Related topics