About synchronizing schemas
In Collibra 2024.05, we launched a new user interface (UI) for Collibra Data Intelligence Platform! You can learn more about this latest UI in the UI overview.
Use the following options to see the documentation in the latest UI or in the previous, classic UI:
Synchronizing schemas is the process of updating the metadata of a registered data source in Collibra Data Intelligence Platform.
You can synchronize a schema manually or automatically at fixed intervals:
- Synchronize manually if you want to test the synchronization of your data source or if you want to synchronize immediately.
- Synchronize automatically if the content of the data source changes regularly.
In this topic
About the synchronization process
After you have registered a data source via Edge, Data Catalog connects to your Edge site to create a list of schemas from the registered database. You can see the list on the Configuration tab page of the Database asset page. On this Configuration tab page, you configure and start the synchronization of schemas in the database.
What happens during the sync?
- The Edge site connects to your data source and ingests schemas, tables, columns, and foreign keys according to the defined synchronization rules.
Edge detects whether there are changes since the last synchronization of a schema and resolves the possible conflicts in the following way.Tip Before starting the synchronization, we recommend to click the Refresh List icon to get the latest schema information from the data source.
Change in data source Result in Collibra for commercial customers
Note The scalable ingestion flow is not enabled for all commercial customers at the same time.*
Result in Collibra for Collibra Cloud for Government Required action A table, column or foreign key has been added to the schema. Collibra creates the assets. Collibra creates the assets. No action is required of you. A table, column or foreign key has been removed from the schema. The existing asset gets the Missing from source status.
If it concerns a table, also the related Column assets get the Missing from source status.This process is called soft delete.
Note If all the content has been removed from the schema, go to "a schema is empty”.
The existing asset gets the Missing from source status.
If it concerns a table, also the related Column assets get the Missing from source status.This process is called soft delete.
Note If all the content has been removed from the schema, we don't update any existing assets in Collibra. Go to "a schema is empty” for more information.
If needed, you can manually delete the assets. A schema is empty. - If the schema is empty during the first sync, Collibra creates only a Schema asset.
- If the schema previously contained tables and is completely empty during a resync, the assets get the status Missing from source.
- If the schema is empty during the first sync, Collibra creates only a Schema asset.
-
If the schema previously contained tables and is completely empty during a resync, Collibra doesn't update any existing assets.
Example:- You synchronized a schema with tables and columns.
As a result, Table and Column assets are created in Collibra. - You then remove all tables and columns from the data source and you synchronize the schema again in Collibra.
Because the schema is empty, the schema is not synchronized and the status of the existing assets in Collibra isn't updated to Missing from source.
- You synchronized a schema with tables and columns.
If needed, you can manually delete the assets. A schema has been removed. The schema gets the Missing from source status. Also the related Table and Column assets get the Missing from source status.
This process is called soft delete.
If you have refreshed the schema list before the synchronization, the schema gets the Missing from source status. Also the related Table and Column assets get the Missing from source status.
This process is called soft delete.
If needed, you can manually delete the Schema asset and all related assets. A column or foreign key has been renamed. - Collibra creates an asset with the new name.
- The existing asset get the Missing from source status.
This process is called soft delete.
- Collibra creates an asset with the new name.
- The existing asset get the Missing from source status.
This process is called soft delete.
If needed, you can apply any manual changes you made to the original asset, to the new asset. And then remove the assets that are no longer applicable. A table has been renamed. - Collibra creates a Table asset with the new name. Collibra also creates new Column assets for the new Table asset.
- The existing Table and related Column assets get the Missing from source status.
This process is called soft delete.
- Collibra creates a Table asset with the new name. Collibra also creates new Column assets for the new Table asset.
- The existing Table and related Column assets get the Missing from source status.
This process is called soft delete.
If needed, you can apply any manual changes you made to the original assets, to the new assets. And then remove the assets that are no longer applicable. A schema has been renamed.
- Collibra creates a Schema asset with the new name. Collibra also creates new Table and Column assets for the new Schema asset.
- The existing schema and related assets get the Missing from source status.
This process is called soft delete.
- Collibra creates a Schema asset with the new name. Collibra also creates new Table and Column assets for the new Schema asset.
- If you have refreshed the schema list before the synchronization, the existing schema and related assets get the Missing from source status.
This process is called soft delete.
If needed, you can apply any manual changes you made to the original assets, to the new assets. And then delete the assets that are no longer applicable. Important You can't use the include table or exclude table rules to synchronize in batches or to synchronize only a selected list of tables. When you register a table and later exclude it using these rules during resynchronization, the related assets get the Missing from Source status.
Example: During the first synchronization, you include table A and table B, so assets are created for them. During the next synchronization, you include only table C. As a result, assets for table C are created, and the assets for table A and table B show the Missing from Source status.Note- If you rename a database in the data source, the Edge synchronization process will consider it a new database. We don’t detect the renaming of a database at this moment.
- Schema, Table, Column or Foreign Key assets with the Missing from source status don't block the synchronization process.
- In the asset diagram, assets with the Missing from source status are shown by default. If you don't want to see these assets, apply a filter to the diagram view to only display assets with valid statuses.
Change in Collibra Result in Collibra for commercial customers
Note The scalable ingestion flow is not enabled for all commercial customers at the same time.*
Result in Collibra for Collibra Cloud for Government and CPSH customers In Collibra, you update asset characteristics that are controlled by the metadata synchronization. To increase performance, Collibra doesn't update the asset characteristics during a resync unless changes are detected at the data source.
However, Collibra can update these characteristics, for example, after a backup has been restored. This makes sure that the synchronization results stay in sync with the data source.Collibra updates the asset characteristics with the data source values during a resync. In Collibra, you update asset characteristics that are not controlled by the metadata synchronization. Collibra doesn't update the asset characteristics that are not controlled by the metadata synchronization during a resync.
Collibra doesn't update the asset characteristics that are not controlled by the metadata synchronization during a resync. In Collibra, you removed all assets including the Schema asset. Collibra recreates all assets during a resync. Collibra recreates all assets during a resync. In Collibra, you removed some assets for a Schema asset To increase performance, Collibra doesn't recreate the assets during a resync unless changes are detected at the data source. Collibra recreates the assets during a resync. In Collibra, you changed the scope of tables to be included in the synchronization via include tables or exclude tables rules. If a table is no longer in the synchronization scope during a resync, its registered assets get the Missing from Source status. If a table is no longer in the synchronization scope during a resync, its registered assets get the Missing from Source status. * The scalable ingestion flow improves the synchronization for JDBC data sources via Edge. It processes data much faster, especially for large schemas. For detailed information, go to the scalable ingestion flow release note.
The scalable ingestion flow doesn't use Import, which means you won't see any Import jobs in the Activities page.
- Collibra creates assets in the selected target domains.
- The created assets get a unique full name (fully qualifying name) based on naming conventions.
You can view the full name of an asset by editing the asset.Warning Don't edit the full name of assets because the name is needed to synchronize or refresh data sources. Changing the full name may cause unexpected results and break the synchronization or refresh process.
- The status of a created asset depends on its asset type. The first status in the asset type's assignment is applied to the new asset.
- If, in the synchronization rule, you have indicated you want to include source tags, the tags defined on the assets in the data source are registered and available from the Schema, Table, Database View, and Column assets in the Source Tags attribute. For more information, go to About source tags.
Note Currently, the JDBC registration process can synchronize source tags only from Snowflake.
- The synchronization jobs of all schemas run in parallel. You can see the synchronization status in the Activities list. A report is created:
- during the synchronization, to show the progress of the synchronization job.
- after the synchronization, to show the synchronization logs for each synchronized schema.
You can also follow up on the synchronization jobs via the database synchronization report (beta).
What happens after the sync?
- You see a check symbol () next to the schema name.
If the synchronization of a schema fails or the schema is no longer available in the source, an exclamation mark () is shown instead. - The synchronized data becomes available. To see which data is added, go to Metadata synchronization results.
Tip If you no longer want to synchronize a schema, and delete the associated assets, go to Remove a synchronized schema.
About synchronization rules
A synchronization rule determines which tables of a schema you synchronize in Data Catalog.
- You can add up to 10 synchronization rules. The order of the rules is important.
For all information, go to Configure the synchronization of a data source. - The Schema asset and the Foreign Key assets are always ingested in the domain defined in the first rule.
- Only schemas that have at least one synchronization rule can be synchronized.
If a schema has a synchronization rule, you see a table icon () next to the schema name. - Synchronization rules can be added, edited, and copied to other schemas in the same data source.
Example: During the first synchronization, you include table A and table B, so assets are created for them. During the next synchronization, you include only table C. As a result, assets for table C are created, and the assets for table A and table B show the Missing from Source status.
The following table shows fields of synchronization rules:
Rule field | Description |
---|---|
Include Tables |
A comma-separated list of the names of the tables you want to synchronize.
Example
|
Exclude Tables |
A comma-separated list of the names of the tables you do not want to synchronize.
You can use exclude to do the following:
Example
|
Target Domain |
The Physical Data Dictionary domain in which the schema is synchronized.
You can select any other Physical Data Dictionary domain for which you have a resource role with the Configure External System resource permission. It is advised, however, to have a domain per schema. |
Options |
Additional options to specify which type of tables you want to synchronize. |
Exclude Database Views
|
A checkbox to exclude database views from the synchronization process. If selected, no assets of the type Database view are created. Tip You can also use Include Tables and Exclude Tables to include or exclude specific database views. |
Include Source Tags
|
If you select this option, the tags defined on the assets in the data source are registered and available from the Schema, Table, Database View, and Column assets in the Source Tags attribute. Note Currently, the JDBC registration process can synchronize source tags only from Snowflake. |
About source tags
Tags created and assigned in the data source can be registered and synchronized in Data Catalog.
To do this, select the Include Source Tags checkbox when you define the synchronization rule for a schema. As a result, the tags defined on the assets in the data source are registered and available from the Schema, Table, Database View, and Column assets in the Source Tags attribute.
Note Currently, the JDBC registration process can synchronize source tags only from Snowflake.
- The naming convention for source tags synchronized from Snowflake is:
<source_tag_name>=<source_tag_value>
, for example: cost_center=sales.<source_tag_name>
, if no values are assigned to the tag, for example PII.
- We apply the same inheritance for source tags as the data source. Example
- If a tag was assigned to a schema in Snowflake, the tag will be registered for the related Schema, Table, Column and View assets in Data Catalog.
- If a tag was assigned to an account in Snowflake, the tag will be registered for the related Schema, Table, Column and View assets in Data Catalog.
For information on tags in Snowflake, go to the Snowflake documentation.
- Don't change the source tags in Data Catalog, as the changes are not pushed to the data source. I If you make changes to the tags in the data source and synchronize the data source again, your updates will be overwritten with the information from the data source.
By default, we read the Snowflake source tags from the <database_name>.INFORMATION_SCHEMA.tag_
references. This is possible with the minimum required permissions for the metadata scan.
To increase the performance of the Snowflake metadata synchronization, we can also read the Snowflake source tags from the SNOWFLAKE.ACCOUNT_USAGE
schema. This can be configured in the Edge JDBC Ingestion capability as an other property with name tags-strategy
and value SINGLE_CALL
. Note that this method requires the SELECT
permission on the SNOWFLAKE.ACCOUNT_USAGE.TAG_REFERENCES
table.