Synchronizing Amazon S3

When you synchronize Amazon S3, the content of your Amazon S3 repository is analyzed and represented by means of assets and their characteristics.
You can synchronize manually, or you can automate it by adding a synchronization schedule by means of a cron expression.

  • You can only synchronize one S3 File System at a time. If a synchronization job is in progress and a second one is triggered, manually or automatically, it will be queued.
  • If a synchronization job is still running and a new synchronization of the same S3 File System is triggered (manually or automatically), the running synchronization will continue and the new synchronization request is ignored.

Technically, the synchronization happens in several steps:

  • If you did not completed the Glue database configuration parameter in the capability:
    1. Collibra creates crawlers in AWS Glue, based on the crawlers defined in Collibra.
    2. If AWS Glue contains databases with metadata from a previous synchronization, the databases are deleted.
    3. Each AWS Glue crawler crawls a location in Amazon S3 based on its include path. For each domain assigned to one or more crawlers, AWS Glue creates a database with the crawling results.
    4. Collibra ingests those databases and creates assets, attributes and relations as required to match the metadata.
      The resulting assets are in the domain that was specified in the crawler.
    5. The AWS Glue crawlers are deleted.
  • If you did completed the Glue database configuration parameter in the capability:

    Collibra ingests those databases defined in Glue database configuration parameter and creates assets, attributes and relations as required to match the metadata.
    The resulting assets are added to the domain specified in the parameter.

    The glue database is never deleted, even if the Delete Glue database left after previous synchronization of the file system parameter is selected in the capability.

Warning Do not move the assets to another domain. Doing so may lead to errors during future synchronizations. This is a known limitation.

Naming convention

Synchronizing Amazon S3 relies on a naming convention to match assets during the synchronization process. We highly recommend that you not change the S3 File System asset's full name.

Warning Editing full name of the S3 File System assets may lead to errors during the synchronization process.