Synchronize Amazon S3
When you synchronize Amazon S3, the content of your Amazon S3 repository is analyzed and represented as assets and their characteristics. You can synchronize manually or automate the process by adding a synchronization schedule.
- You can only synchronize one S3 File System at a time. If a synchronization job is in progress and a second one is triggered (manually or automatically), it is queued.
- If a synchronization job is still running and a new synchronization of the same S3 File System is triggered (manually or automatically), the running synchronization continues and the new synchronization request is ignored.
- You can only create one synchronization schedule.
Technically, the synchronization happens in several steps:
- If you did not complete the Glue database configuration parameter in the capability:
- Collibra creates crawlers in AWS Glue, based on the crawlers defined in Collibra.
- If AWS Glue contains databases with metadata from a previous synchronization, the databases are deleted.
- Each AWS Glue crawler crawls a location in Amazon S3 based on its include path. For each domain assigned to one or more crawlers, AWS Glue creates a database with the crawling results.
- Collibra ingests those databases and creates assets, attributes, and relations as required to match the metadata. The resulting assets are in the domain that was specified in the crawler.
- The AWS Glue crawlers are deleted.
-
If you did complete the Glue database configuration parameter in the capability:
Collibra ingests the databases defined in the Glue database configuration parameter and creates assets, attributes, and relations as required to match the metadata. The resulting assets are added to the domain specified in the parameter. The Glue database is never deleted, even if the Delete Glue database left after previous synchronization of the file system parameter is selected in the capability.
Warning Do not move the assets to another domain. Doing so may lead to errors during future synchronizations. This is a known limitation.
Naming convention
Synchronizing Amazon S3 relies on a naming convention to match assets during the synchronization process. We highly recommend that you do not change the full name of the S3 File System asset.
Warning Editing full name of the S3 File System assets may lead to errors during the synchronization process.
Prerequisites
In your Collibra environment
- You have registered an Amazon S3 file system.
- You have connected an S3 File System asset to Amazon S3.
- You have created one or more crawlers. If you completed the Glue database configuration parameter in the capability, you don't need to create crawlers.
- You have a global role with the View Edge connections and capabilities global permission, for example, Edge integration engineer.
- You have a global role with the Catalog global permission, for example, Catalog Author.
- You have a resource role with the Configure external system resource permission on the community or domain that contains the S3 File System, for example Owner.
- You have a role with the following resource permissions on the S3 community you created when you registered an Amazon S3 file system:
- Asset: add
- Attribute: add
- Domain: add
- Attachment: add
In your AWS environment
- You have a programmatic AWS user and IAM role with the required permissions.
Steps
- Open an S3 File System asset page.
- In the tab bar, click Configuration.
- In the Crawlers section, click Synchronize.
A notification indicates synchronization has started.
The synchronization job appears in the Activities list as a bulk synchronization.
The Synchronization Schedule section displays the time of the last synchronization.
Note In case of a partial synchronization caused by a temporary communication issue, the status of the assets that cannot be synchronized is set to Missing from source. During the next fully successful synchronization, the assets are removed or their previous status is restored, depending on their actual status in the source system.
- Open an S3 File System asset page.
- In the tab bar, click Configuration.
- In the Synchronization Schedule section, click Add Schedule.
- Enter the required information.
Field Description Repeat The interval when you want to synchronize automatically. The possible values are: Daily, Weekly, Monthly, and Cron expression. CronThe Quartz Cron expression that determines when the synchronization takes place.
This field is only visible if you select
Cron expressionin the Repeat field.EveryThe day on which you want to synchronize, for example, Sunday.
This field is only visible if you select
Weeklyin the Repeat field.Every firstThe day of the month on which you want to synchronize, for example, Tuesday.
This field is only visible if you select
Monthlyin the Repeat field.At
The time at which you want to synchronize automatically, for example, 14:00.
- You can only schedule on the hour. For example, you can add a synchronization schedule at 8:00, but not at 8:45.
- This field is only visible if you select
Daily,Weekly, orMonthlyin the Repeat field.
Time zone The time zone for the schedule. - Click Save.
After the synchronization:
- You can view a summary of the results from the Activities list.
- You can view integrated Amazon S3 data.