About synchronizing Amazon S3
Synchronizing Amazon S3 is the process of ingesting metadata from a selected Amazon S3 repository and making the data available in Collibra Data Intelligence Cloud.
When you synchronize Amazon S3, the content of your Amazon S3 repository is analyzed and represented in Collibra by means of assets and their characteristics.
Technically, the synchronization happens in several steps:
- Collibra creates crawlers in AWS Glue, based on the crawlers defined in Collibra.
- If AWS Glue contains databases with metadata from a previous synchronization, the databases are deleted.
- Each AWS Glue crawler crawls a location in Amazon S3 based on its include path. For each domain assigned to one or more crawlers, AWS Glue creates a database with the crawling results.
- Collibra ingests those databases and creates assets, attributes and relations as required to match the metadata.
- The AWS Glue crawlers are deleted.
Starting the synchronization
You can synchronize manually, or you can automate it by adding a synchronization schedule by means of a cron expression.
You can only synchronize one S3 File System at a time. If a synchronization job is in progress and a second one is triggered, manually or automatically, it will be queued.
If a synchronization job is still running and a new synchronization of the same S3 File System is triggered (manually or automatically), the running synchronization will continue and the new synchronization request is ignored.
Synchronization results
After synchronization, the resulting assets are in the domain that was specified in the crawler.
Warning Do not move the assets to another domain. Doing so may lead to errors during future synchronizations. This is a known limitation.
By default, the assets are shown in a plain list, but you can enable a multi-path hierarchy to show it in a tree structure. For the best result, we recommend that you use the following relations:
- S3 Bucket contains Directory
- Directory contains Directory
- Directory contains File
- Directory contains File Group
- File contains Table
- File Group contains Table
- Table contains Column
The following images shows the resulting hierarchical table.
Naming convention
Synchronizing Amazon S3 relies on a naming convention to match assets during the synchronization process. We highly recommend that you not change the S3 File System asset's full name.
Warning Editing full name of the S3 File System assets may lead to errors during the synchronization process.