About data similarity (Beta)
Data similarity is a feature that proposes similar assets. This can be useful for data consumers to find other relevant assets.
Currently, this feature is available in Data Marketplace to show similar Table assets.
Tip
This feature is powered by Collibra AI. To calculate the data similarity between tables, Collibra AI compares the tables and their columns, and calculates a similarity score. This similarity score indicates how similar assets are.
- Data similarity is a cloud-only Edge feature that must be enabled. It is not certified for FedRAMP.
- This feature is in Beta testing.
How is the table similarity score calculated?
The table similarity score indicates the similarity between two tables and ranges from 0 (not similar) to 100 (very similar). The score combines three metrics:
- Table name similarity: This metric compares the table names.
- Column name similarity: This metric compares all column names. It checks how many columns have a similarly named column, and how similar the names are.
- Column content similarity: This metric compares the data in all columns. It checks how many columns have similar content.
Once the metric scores are known, the following formula is used: total_similarity = (0.10 * table_name_similarity + 0.20 * column_names_similarity + 0.70 * column_content_similarity) * 100
Note We compare only tables and columns that have been profiled. For more information, go to Enable and calculate data similarity.
Data similarity in Data Marketplace
If the data similarity feature is enabled in your environment, the Similar Data (Beta) tab is shown in Table asset previews. The tab is shown only if tables with a similarity score higher than 50% exist for the table. Up to five Table assets are shown.
FAQ on data similarity
-
Which data is used for similarity comparisons?
Currently, Collibra AI compares only profiled tables and columns. -
Where can you see similarity?
Currently, this feature is available only in Data Marketplace for Table asset comparison. Similar Table assets are shown in the Similar Data (Beta) tab. -
How does data similarity work?
During data profiling, we use privacy enhancing technologies such as irreversible hashes to summarize your table and column data on Edge. Collibra AI only uses a subset of the actual data, maximum 10,000 rows, to create these hashes These summaries are stored in Collibra. When you open a Table preview in Data Marketplace, the summaries are used to find similar data assets.
If you want to know more, reach out to your Customer Success Manager. -
Does data similarity use Generative AI or Large Language Models (LLMs)?
No, data similarity does not use generative AI or LLMs.