About synchronizing Amazon S3

Synchronizing Amazon S3 is the process of ingesting metadata from a selected Amazon S3 repository and making the data available in Collibra Data Intelligence Cloud.

When you synchronize Amazon S3, the content of your Amazon S3 repository is analyzed and represented in Collibra by means of assets and their characteristics.
Technically, the synchronization happens in several steps:

Collibra creates crawlers in AWS Glue, based on the crawlers defined in Collibra.
If AWS Glue contains databases with metadata from a previous synchronization, the databases are deleted.
Each AWS Glue crawler crawls a location in Amazon S3 based on its include path. For each domain assigned to one or more crawlers, AWS Glue creates a database with the crawling results.
Collibra ingests those databases and creates assets, attributes and relations as required to match the metadata.
The AWS Glue crawlers are deleted.

Starting the synchronization

You can synchronize manually, or you can automate it by adding a synchronization schedule by means of a cron expression.

You can only synchronize one S3 File System at a time. If a synchronization job is in progress and a second one is triggered, manually or automatically, it will be queued.

If a synchronization job is still running and a new synchronization of the same S3 File System is triggered (manually or automatically), the running synchronization will continue and the new synchronization request is ignored.

Synchronization results

After synchronization, the resulting assets are in the domain that was specified in the crawler.

Warning Do not move the assets to another domain. Doing so may lead to errors during future synchronizations. This is a known limitation.

By default, the assets are shown in a plain list, but you can enable a multi-path hierarchy to show it in a tree structure. For the best result, we recommend that you use the following relations:

S3 Bucket contains Directory
Directory contains Directory
Directory contains File
Directory contains File Group
File contains Table
File Group contains Table
Table contains Column

The following images shows the resulting hierarchical table.

Naming convention

Synchronizing Amazon S3 relies on a naming convention to match assets during the synchronization process. We highly recommend that you not change the S3 File System asset's full name.

Warning Editing full name of the S3 File System assets may lead to errors during the synchronization process.