About synchronizing schemas

Important 

In Collibra 2024.05, we launched a new user interface (UI) for Collibra Platform! You can learn more about this latest UI in the UI overview.

Use the following options to see the documentation in the latest UI or in the previous, classic UI:

In this topic

About the synchronization process

Synchronizing schemas is the process of updating the metadata of a registered data source in Collibra Platform.

After you have registered a data source via Edge, Data Catalog connects to your Edge site to create a list of schemas from the registered database. You can see the list on the Configuration tab page of the Database asset page.
On this Configuration tab page, you configure and start the synchronization of schemas in the database.
You can then synchronize a schema manually or automatically at fixed intervals:

Tip Before starting the synchronization, we recommend to click the Refresh List icon to get the latest schema information from the data source.

What happens during the synchronization?

  • The Edge site connects to your data source and ingests schemas, tables, columns, and foreign keys according to the defined synchronization rules.
    Edge detects whether there are changes since the last synchronization of a schema and resolves the possible conflicts in the following way.
    Note The scalable ingestion flow has been rolled out for commercial customers. It enhances the synchronization for JDBC data sources via Edge and processes data much faster, especially for large schemas. The scalable ingestion flow doesn't use Import, so you won't see any Import jobs in the page.

    Change in data source

    Change in data source

    Result in Collibra for commercial customers

    Result in Collibra for Collibra Platform for Government
    Required action
    A table, column or foreign key has been added to the schema.Collibra creates the assets.Collibra creates the assets.No action is required of you.
    A table, column or foreign key has been removed from the schema.

    The existing asset gets the Missing from source status.
    If it concerns a table, also the related Column assets get the Missing from source status.

    This process is called soft delete.

    Note If all the content has been removed from the schema, go to "a schema is empty”.

    The existing asset gets the Missing from source status.
    If it concerns a table, also the related Column assets get the Missing from source status.

    This process is called soft delete.

    Note If all the content has been removed from the schema, we don't update any existing assets in Collibra. Go to "a schema is empty” for more information.

    If needed, you can manually delete the assets.
    A schema is empty.
    • If the schema is empty during the first sync, Collibra creates only a Schema asset.
    • If the schema previously contained tables and is completely empty during a resync, the assets get the status Missing from source.
    • If the schema is empty during the first sync, Collibra creates only a Schema asset.
    • If the schema previously contained tables and is completely empty during a resync, Collibra doesn't update any existing assets.
      Example:
      • You synchronized a schema with tables and columns.
        As a result, Table and Column assets are created in Collibra.
      • You then remove all tables and columns from the data source and you synchronize the schema again in Collibra.
        Because the schema is empty, the schema is not synchronized and the status of the existing assets in Collibra isn't updated to Missing from source.
    If needed, you can manually delete the assets.
    A schema has been removed.

    The schema gets the Missing from source status. Also the related Table and Column assets get the Missing from source status.

    This process is called soft delete.

    If you have refreshed the schema list before the synchronization, the schema gets the Missing from source status. Also the related Table and Column assets get the Missing from source status.

    This process is called soft delete.

    If needed, you can manually delete the Schema asset and all related assets.
    A column or foreign key has been renamed.
    • Collibra creates an asset with the new name.
    • The existing asset get the Missing from source status.
      This process is called soft delete.
    • Collibra creates an asset with the new name.
    • The existing asset get the Missing from source status.
      This process is called soft delete.
    If needed, you can apply any manual changes you made to the original asset, to the new asset. And then remove the assets that are no longer applicable.
    A table has been renamed.
    • Collibra creates a Table asset with the new name. Collibra also creates new Column assets for the new Table asset.
    • The existing Table and related Column assets get the Missing from source status.
      This process is called soft delete.
    • Collibra creates a Table asset with the new name. Collibra also creates new Column assets for the new Table asset.
    • The existing Table and related Column assets get the Missing from source status.
      This process is called soft delete.
    If needed, you can apply any manual changes you made to the original assets, to the new assets. And then remove the assets that are no longer applicable.

    A schema has been renamed.

    • Collibra creates a Schema asset with the new name. Collibra also creates new Table and Column assets for the new Schema asset.
    • The existing schema and related assets get the Missing from source status.
      This process is called soft delete.
    • Collibra creates a Schema asset with the new name. Collibra also creates new Table and Column assets for the new Schema asset.
    • If you have refreshed the schema list before the synchronization, the existing schema and related assets get the Missing from source status.
      This process is called soft delete.
    If needed, you can apply any manual changes you made to the original assets, to the new assets. And then delete the assets that are no longer applicable.
    Important 
    You can't use the include table or exclude table rules to synchronize in batches or to synchronize only a selected list of tables. When you register a table and later exclude it using these rules during resynchronization, the related assets get the Missing from Source status. Example: During the first synchronization, you include table A and table B, so assets are created for them. During the next synchronization, you include only table C. As a result, assets for table C are created, and the assets for table A and table B show the Missing from Source status.
    Note 
    • If you rename a database in the data source, the Edge synchronization process will consider it a new database. We don’t detect the renaming of a database at this moment.
    • Schema, Table, Column or Foreign Key assets with the Missing from source status don't block the synchronization process.
    • In the asset diagram, assets with the Missing from source status are shown by default. If you don't want to see these assets, apply a filter to the diagram view to only display assets with valid statuses.

    Change in Collibra

    Change in Collibra

    Result in Collibra for commercial customers

    Result in Collibra for Collibra Platform for Government and CPSH customers
    In Collibra, you update asset characteristics that are controlled by the metadata synchronization.To increase performance, Collibra doesn't update the asset characteristics during a resync unless changes are detected at the data source.
    However, Collibra can update these characteristics, for example, after a backup has been restored. This makes sure that the synchronization results stay in sync with the data source.
    Collibra updates the asset characteristics with the data source values during a resync.
    In Collibra, you update asset characteristics that are not controlled by the metadata synchronization.

    Collibra doesn't update the asset characteristics that are not controlled by the metadata synchronization during a resync.

    Collibra doesn't update the asset characteristics that are not controlled by the metadata synchronization during a resync.
    In Collibra, you removed all assets including the Schema asset.Collibra recreates all assets during a resync.Collibra recreates all assets during a resync.
    In Collibra, you removed some assets for a Schema asset To increase performance, Collibra doesn't recreate the assets during a resync unless changes are detected at the data source. Collibra recreates the assets during a resync.
    In Collibra, you changed the scope of tables to be included in the synchronization via include tables or exclude tables rules. If a table is no longer in the synchronization scope during a resync, its registered assets get the Missing from Source status. If a table is no longer in the synchronization scope during a resync, its registered assets get the Missing from Source status.
    In Collibra, you used asset views to move registered assets to a domain other than the domain specified in the synchronization rule.Assets are moved back to the specified target domain during the next sync only if changes to the assets are detected in the data source.Assets are moved back to the specified target domain during the next sync.
  • Collibra creates assets in the selected target domains.
    • The created assets get a unique full name (fully qualifying name) based on naming conventions.
      You can view the full name of an asset by editing the asset.

      Warning Don't edit the full name of an asset because the name is needed to synchronize or refresh data sources. Changing the full name may cause unexpected results and break the synchronization or refresh process.

    • The status of a created asset depends on its asset type. The first status in the asset type's assignment is applied to the new asset.
    • If, in the synchronization rule, you have indicated you want to include source tags, the tags defined on the assets in the data source are registered and available from the Schema, Table, Database View, and Column assets in the Source Tags attribute. For more information, go to About source tags.

      Note Currently, the JDBC registration process can synchronize source tags only from Snowflake.

  • The synchronization jobs of all schemas run in parallel. You can see the synchronization status in the Activities list. A report is created:
    • during the synchronization, to show the progress of the synchronization job.
    • after the synchronization, to show the synchronization logs for each synchronized schema.

    You can also follow up on the synchronization jobs via the database synchronization report (beta).

What happens after the synchronization?

  • You see a check symbol () next to the schema name.
    If the synchronization of a schema fails or the schema is no longer available in the source, an exclamation mark () is shown instead.

  • The synchronized data becomes available. To see which data is added, go to Metadata synchronization results.

Tip If you no longer want to synchronize a schema, and delete the associated assets, go to Remove a synchronized schema.

About synchronization rules

A synchronization rule determines which tables of a schema you synchronize in Data Catalog.

  • You can add up to 10 synchronization rules. The order of the rules is important.
    For all information, go to Configure the synchronization of a data source.
  • The Schema asset and the Foreign Key assets are always ingested in the domain defined in the first rule.
  • Only schemas that have at least one synchronization rule can be synchronized.
    If a schema has a synchronization rule, you see a table icon () next to the schema name.
  • Synchronization rules can be added, edited, and copied to other schemas in the same data source.
Important 
You can't use the include table or exclude table rules to synchronize in batches or to synchronize only a selected list of tables. When you register a table and later exclude it using these rules during resynchronization, the related assets get the Missing from Source status. Example: During the first synchronization, you include table A and table B, so assets are created for them. During the next synchronization, you include only table C. As a result, assets for table C are created, and the assets for table A and table B show the Missing from Source status.

The following table shows fields of synchronization rules:

Rule field Description
Include Tables

A comma-separated list of the names of the tables you want to synchronize.

  • In the list, add a space after each comma. For example, CUSTOMERS, ORDER, SKU.
  • Don’t add new lines to the list. Only add comma-separated values.
  • This field is not case-sensitive.
  • You can use * as a wildcard. For example, SKU*.
  • The default value is *, which means all tables are taken into account.
  • If the name of a table contains a special character, like . + * \ ? ^ $ ( ) [ ] { } | then add a \ before the special character for it to be correctly evaluated. For example, *SKU\+*.
  • The Include Tables field is processed before the field.
Example 
  • Out of all tables in a schema, you only want to synchronize the table with name "CUSTOMERS" and the tables with a name that starts with "ORDER".
    To do this:
    In the Include Tables field, enter: CUSTOMERS, ORDER*.
  • Out of all tables in a schema, you only want to synchronize the tables with a name that contains "SKU".
    To do this:
    In the Include Tables field, enter: *SKU*.
  • Out of all tables in a schema, you only want to include the tables with a name that contains "SKU+".
    To do this:
    In the Include Tables field, enter: *SKU\+*.
Exclude Tables

A comma-separated list of the names of the tables you do not want to synchronize.

  • In the list, add a space after each comma. For example, CUSTOMERS, ORDER, SKU.
  • Don’t add new lines to the list. Only add comma-separated values.
  • This field is not case-sensitive.
  • You can use * as a wildcard.
  • If the name of a table contains a special character, like . + * \ ? ^ $ ( ) [ ] { } | then add a \ before the special character for it to be correctly evaluated. For example, *SKU\+*.
  • The Include Tables field is processed before the field.

You can use exclude to do the following:

  • Synchronize all tables in a schema except the ones defined in the Exclude Tables field.
  • Synchronize only tables as defined in the Include Tables field, with the exception of tables that are listed in the Exclude Tables field.
Example 
  • Out of all tables in a schema, you do not want to synchronize a table with the name "ADDRESS" and tables with a name that ends with "PHONE".
    To do this:
    In the Include Tables field, enter: * and in the Exclude Tables field, enter: ADDRESS, *PHONE.
  • Out of all tables in a schema, you only want to exclude the table with name "example$table".
    To do this:
    In the Include Tables field, enter: * and in the Exclude Tables field, enter: example\$table.
  • Out of all tables in a schema, you want to synchronize the tables with a name that starts with "SKU", but exclude the tables with a name that contains "bkp".
    To do this:
    In the Include Tables field, enter: SKU* and in the Exclude Tables field, enter: *bkp*.
    From the following list, only "SKU_1" and "SKU_2" will be synchronized.
    SKU_1 , SKU_2, SKU_bkp_1, SKU_bkp_2, New, bkp, bkp_SKU
Target Domain

The Physical Data Dictionary domain in which the schema is synchronized.

The default value is Schema domain. The default value is Automatically Created for Schema. This means the metadata is placed in a domain located in the same community as the domain of your Database asset. If that domain doesn't exist yet, Data Catalog creates the domain using the following naming convention: [edge_connection_name] > [database_name] > [schema_name], for example Snowflake Connection > CERTIFICATION > CUSTOMERS.

You can select any other Physical Data Dictionary domain for which you have a resource role with the Configure External System resource permission. It is advised, however, to have a domain per schema.

Note Use this method to move assets from one domain to another. Don't move assets through asset views.

Options

Additional options to specify which type of tables you want to synchronize.

Exclude Database Views

A checkbox to exclude database views from the synchronization process. If selected, no assets of the type Database view are created.

Tip You can also use Include Tables and Exclude Tables to include or exclude specific database views.

Include Source Tags

If you select this option, the tags defined on the assets in the data source are registered and available from the Schema, Table, Database View, and Column assets in the Source Tags attribute.

Note Currently, the JDBC registration process can synchronize source tags only from Snowflake.

Include Data Usage Statistics (beta)

If you select the Include Data Usage Statistics option, the data usage statistics of the assets in the data source are ingested, a popularity score is calculated, and the information is made available for Table, View, and Column assets. The information includes:

  • The number of queries.
  • The number of users.
  • A popularity score that indicates how often an asset is used in queries on the source system compared to other assets with ingested data usage statistics in Data Catalog.

For more information, go to About data usage statistics and popularity score.

Note  The Data Usage Statistics setting must be enabled for this field to be available.

Note Currently, the JDBC registration process can synchronize the data usage statistics and calculate a popularity score only for Snowflake.

About source tags

Tags created and assigned in the data source can be registered and synchronized in Data Catalog.
To do this, select the Include Source Tags checkbox when you define the synchronization rule for a schema. As a result, the tags defined on the assets in the data source are registered and available from the Schema, Table, Database View, and Column assets in the Source Tags attribute.

Note Currently, the JDBC registration process can synchronize source tags only from Snowflake.

  • The naming convention for source tags synchronized from Snowflake is:
    • <source_tag_name>=<source_tag_value>, for example: cost_center=sales.
    • <source_tag_name>, if no values are assigned to the tag, for example PII.
  • We apply the same inheritance for source tags as the data source.
    Example 
    • If a tag was assigned to a schema in Snowflake, the tag will be registered for the related Schema, Table, Column and View assets in Data Catalog.
    • If a tag was assigned to an account in Snowflake, the tag will be registered for the related Schema, Table, Column and View assets in Data Catalog.

    For information on tags in Snowflake, go to the Snowflake documentation.

  • Don't change the source tags in Data Catalog, as the changes are not pushed to the data source. I If you make changes to the tags in the data source and synchronize the data source again, your updates will be overwritten with the information from the data source.

By default, we read the Snowflake source tags from the <database_name>.INFORMATION_SCHEMA.tag_ references. This is possible with the minimum required permissions for the metadata scan.
To increase the performance of the Snowflake metadata synchronization, we can also read the Snowflake source tags from the SNOWFLAKE.ACCOUNT_USAGE schema. This can be configured in the Edge JDBC Ingestion capability as an other property with name tags-strategy and value SINGLE_CALL. Note that this method requires the SELECT permission on the SNOWFLAKE.ACCOUNT_USAGE.TAG_REFERENCES table.

About data usage statistics and popularity score (beta)

Note Currently, the JDBC registration process can synchronize the data usage statistics and calculate a popularity score only for Snowflake.

Data usage statistics and popularity scores help you understand how often the data in a data source is used. This can indicate that the data is valuable and trustworthy.

To synchronize the usage statistics of data in a data source and calculate a popularity score, the Data Usage Statistics setting must be enabled and you must select the Include Data Usage Statistics option when defining the synchronization rules for a schema. As a result, the following data usage information becomes available for Table, View, and Column assets:

  • Number of distinct users: The number of people who used the asset in a query on the source system within a specific time frame. During the synchronization, this information is gathered from the data source for the past 30 days.
  • Number of queries: The number of times the asset has been part of a query on the source system within a specific time frame. During the synchronization, this information is gathered from the data source for the past 30 days.
  • Popularity score: This score is calculated by Collibra and indicates how often an asset is used in queries on the source system compared to all other assets for which the data usage statistics have been ingested in Data Catalog.
    The possible score values are Very high, High, , Low, and None. The value is available in the Popularity Score attribute of an asset.

View data usage statistics and popularity score (beta)

The Popularity Score field is shown in the At a glance sidebar for Table, View, and Column assets. You can also add this attribute to table asset views and in the asset pages.

When you click the Popularity Score value, a dialog box appears with the following sections:

  • Data popularity: Shows the popularity score, the number of queries, and the number of distinct users of the data.

    • The popularity score reflects the latest calculated score for the asset, which refreshes daily. For details, go to Popularity score calculation (beta).
    • Total queries: The total number of queries from the last time the 30-day data was gathered.
    • Total users: The total number of distinct users of the data from the last time the 30-day data was gathered.
  • Query History: Shows the number of queries within a specific time frame. You can change this time frame.

Note The popularity score of an asset is recalculated once a day, at 2 AM server time, and takes the data usage statistics of the last 30 days of the last gathered information into account. This means the popularity can be high even though no results appear for the defined query history.

Popularity score calculation (beta)

The popularity score is calculated as follows:

  1. For the ingested assets, Collibra calculates a numeric score based on the number of queries and number of distinct users.
  2. The assets are ordered by the calculated numeric score and split into 5 groups.
  3. Based on the assigned group, the Popularity Score value of an asset is set.
    The possible values are:
    1. If an asset has not been queried, the Popularity Score value is None.
    2. If the asset is in the lowest 25% of assets, <=P25, the Popularity Score value is Low.
    3. If the asset is in the second group, >P25 and <=P50, the Popularity Score value is Medium.
    4. If the asset is in the third group, >P50 and <=P75, the Popularity Score value is High.
    5. If the asset is in the highest 25% of assets, >P75, the Popularity Score value is Very High.

Important The popularity score of an asset is recalculated once a day, at 2 AM server time, and takes the data usage statistics of the last 30 days of the last gathered information into account.