About synchronizing schemas

About the synchronization process

Synchronizing schemas is the process of updating the metadata of a registered data source in Collibra Platform.

After you have registered a data source via Edge, Data Catalog connects to your Edge site to create a list of schemas from the registered database. You can see the list on the Configuration tab page of the Database asset page.
On this Configuration tab page, you configure and start the synchronization of schemas in the database.
You can then synchronize a schema manually or automatically at fixed intervals:

Tip Before starting the synchronization, we recommend to click the Refresh List icon to get the latest schema information from the data source.

What happens during synchronization

What happens after synchronization

Tip If you no longer want to synchronize a schema, and delete the associated assets, go to Remove a synchronized schema.

About synchronization rules

A synchronization rule determines which tables of a schema you synchronize in Data Catalog.

Important 
You can't use the include table or exclude table rules to synchronize in batches or to synchronize only a selected list of tables. When you register a table and later exclude it using these rules during resynchronization, the related assets get the Missing from Source status. Example: During the first synchronization, you include table A and table B, so assets are created for them. During the next synchronization, you include only table C. As a result, assets for table C are created, and the assets for table A and table B show the Missing from Source status.

The following table shows fields of synchronization rules:

Rule field Description
Include Tables

A comma-separated list of the names of the tables you want to synchronize.

  • In the list, add a space after each comma. For example, CUSTOMERS, ORDER, SKU.
  • Don’t add new lines to the list. Only add comma-separated values.
  • This field is not case-sensitive.
  • You can use * as a wildcard. For example, SKU*.
  • The default value is *, which means all tables are taken into account.
  • If the name of a table contains a special character, like . + * \ ? ^ $ ( ) [ ] { } | then add a \ before the special character for it to be correctly evaluated. For example, *SKU\+*.
  • The Include Tables field is processed before the field.
Example 
  • Out of all tables in a schema, you only want to synchronize the table with name "CUSTOMERS" and the tables with a name that starts with "ORDER".
    To do this:
    In the Include Tables field, enter: CUSTOMERS, ORDER*.
  • Out of all tables in a schema, you only want to synchronize the tables with a name that contains "SKU".
    To do this:
    In the Include Tables field, enter: *SKU*.
  • Out of all tables in a schema, you only want to include the tables with a name that contains "SKU+".
    To do this:
    In the Include Tables field, enter: *SKU\+*.
Exclude Tables

A comma-separated list of the names of the tables you do not want to synchronize.

  • In the list, add a space after each comma. For example, CUSTOMERS, ORDER, SKU.
  • Don’t add new lines to the list. Only add comma-separated values.
  • This field is not case-sensitive.
  • You can use * as a wildcard.
  • If the name of a table contains a special character, like . + * \ ? ^ $ ( ) [ ] { } | then add a \ before the special character for it to be correctly evaluated. For example, *SKU\+*.
  • The Include Tables field is processed before the field.

You can use exclude to do the following:

  • Synchronize all tables in a schema except the ones defined in the Exclude Tables field.
  • Synchronize only tables as defined in the Include Tables field, with the exception of tables that are listed in the Exclude Tables field.
Example 
  • Out of all tables in a schema, you do not want to synchronize a table with the name "ADDRESS" and tables with a name that ends with "PHONE".
    To do this:
    In the Include Tables field, enter: * and in the Exclude Tables field, enter: ADDRESS, *PHONE.
  • Out of all tables in a schema, you only want to exclude the table with name "example$table".
    To do this:
    In the Include Tables field, enter: * and in the Exclude Tables field, enter: example\$table.
  • Out of all tables in a schema, you want to synchronize the tables with a name that starts with "SKU", but exclude the tables with a name that contains "bkp".
    To do this:
    In the Include Tables field, enter: SKU* and in the Exclude Tables field, enter: *bkp*.
    From the following list, only "SKU_1" and "SKU_2" will be synchronized.
    SKU_1 , SKU_2, SKU_bkp_1, SKU_bkp_2, New, bkp, bkp_SKU
Target Domain

The Physical Data Dictionary domain in which the schema is synchronized.

The default value is Automatically Created for Schema. This means the metadata is placed in a domain located in the same community as the domain of your Database asset. If that domain doesn't exist yet, Data Catalog creates the domain using the following naming convention: [edge_connection_name] > [database_name] > [schema_name], for example Snowflake Connection > CERTIFICATION > CUSTOMERS.

You can select any other Physical Data Dictionary domain for which you have a resource role with the Configure External System resource permission. It is advised, however, to have a domain per schema.

Note Use this method to move assets from one domain to another. Don't move assets through asset views.

Options

Additional options to specify which type of tables you want to synchronize.

Exclude Database Views

A checkbox to exclude database views from the synchronization process. If selected, no assets of the type Database view are created.

Tip You can also use Include Tables and Exclude Tables to include or exclude specific database views.

Include Source Tags

If you select this option, the tags defined on the assets in the data source are registered and available from the Schema, Table, Database View, and Column assets in the Source Tags attribute.

Note Currently, the JDBC registration process can synchronize source tags only from Snowflake via this method.
For information on synchronizing the source tags from Databricks Unity Catalog, go to Add the Databricks Unity Catalog capability.

Include Data Usage Statistics

If you select the Include Data Usage Statistics option, the data usage statistics of the assets in the data source are ingested, a popularity score is calculated, and the information is made available for Table, View, and Column assets. The information includes:

  • The number of queries.
  • The number of users.
  • A popularity score that indicates how often an asset is used in queries on the source system compared to other assets with ingested data usage statistics in Data Catalog.

For more information, go to About data usage statistics and popularity score.

Important 
  • Currently, the JDBC registration process can synchronize the data usage statistics and calculate a popularity score only for Snowflake.
  • This feature is available only in the latest UI for Commercial Cloud customers and when the scalable ingestion flow is enabled.
  • The popularity score of an asset is recalculated twice a day and takes the data usage statistics of the last 30 days of the last gathered information into account.

About source tags

Tags created and assigned in the data source can be registered and synchronized in Data Catalog. To allow this, select the Include Source Tags checkbox when you define the synchronization rule for a schema. This ensures that the tags defined on the assets in the data source are registered and available from the Schema, Table, Database View, and Column assets in the Source Tags attribute.

Note Currently, the JDBC registration process can synchronize source tags only from Snowflake via this method.
For information on synchronizing the source tags from Databricks Unity Catalog, go to Add the Databricks Unity Catalog capability.

By default, we read the Snowflake source tags from the <database_name>.INFORMATION_SCHEMA.tag_ references. This is possible with the minimum required permissions for the metadata scan.
To increase the performance of the Snowflake metadata synchronization, we can also read the Snowflake source tags from the SNOWFLAKE.ACCOUNT_USAGE schema. This can be configured in the Edge JDBC Ingestion capability as an other property with name tags-strategy and value SINGLE_CALL. Note that this method requires the SELECT permission on the SNOWFLAKE.ACCOUNT_USAGE.TAG_REFERENCES table.

Related topics

About data usage statistics and popularity score