Add the S3 synchronization capability
After you have enabled the settings to integrate S3 and you have an S3 connection, you need to add the S3 synchronization capability to the connection.
Before you begin
Required permissions
- You have a global role that has the System administration global permission.
- You have a global role that has the Manage connections and capabilities global permission, for example, Edge integration engineer.
Steps
- Open an Edge site.
-
On the main menu, click
, and then click
Settings.
The Collibra settings page opens. -
In the tab pane, click Edge.
The Sites tab opens and shows a table with an overview of the Edge sites. - In the table, click the name of the Edge site whose status is Healthy.
The Edge site page opens.
-
On the main menu, click
- In the Capabilities section, click Add capability.
The Add capability page is shown. - Enter the required information.
Field Description Required Capability
This section contains general information about the capability.
NameThe name of the Edge capability.
Yes
DescriptionThe description of the Edge capability.
No
Capability templateThe capability template. The value that you select in this field determines which sections appear on the page.
Select the following Edge capability:
S3 synchronization
Yes
S3 service account
This section contains information about how to connect to Amazon S3. AWS ConnectionThe AWS connection to be used.
Yes
IAM roleThe IAM role to be used by the AWS Glue crawlers.
Yes
Encryption optionsSelect the type of encryption used to store the IAM role.
Default: To be encrypted by Edge management server.
Yes
Delete Glue database left after previous synchronization of the file systemSelect the checkbox if you want the capability to delete the Glue databases created by previous runs of the capability, before the capability starts the synchronization.
If you deselect this checkbox, the Glue databases created by previous runs are not removed. This can be useful for troubleshooting.By default, this checkbox is selected.
No
Save input metadataSelect the checkbox if you want to save the input metadata extracted from the data source in ZIP files. The files can be useful for troubleshooting.
Select this option only on request of Collibra Support. The Collibra Support team can provide the location of the saved ZIP files after the S3 synchronization.By default, this checkbox is not selected.
No
Finalization StrategyDefine what you want to do if an asset has been deleted from the S3 data source after an initial synchronization.
The possible values are:- Change Status (default): If an asset has been deleted from the S3 data source after an initial synchronization, we update the status of the asset in Collibra to "Missing from source".
- Remove Resources: If an asset has been deleted from the S3 data source after an initial synchronization, we remove the asset from Collibra.
- Ignore: If an asset has been deleted from the S3 data source after an initial synchronization, we don't change anything for the asset in Collibra.
Yes
Custom parameterDefine additional parameters for the synchronization.
No
Advanced Configuration
This section contains configuration options that can help when investigating issues with the capability.
Important Only complete the fields Logging configuration, Memory (MiB), and JVM arguments on request of or together with Collibra Support.
No
Glue database configuration
Text in JSON format to define the Glue database names, regions, and domain IDs that you want to integrate.
Tip Use this parameter if the current S3 synchronization crawler configuration doesn’t meet your needs. With this parameter, you can integrate an AWS Glue database for which you defined crawlers in AWS Glue itself. This allows you to use all crawler options from the AWS Glue Console.
Important If you use this parameter, any crawlers you create in Collibra will not be taken into account during the S3 synchronization. You, however, will need to create a dummy crawler in Collibra to start the synchronization. A dummy crawler is a crawler with an invalid include path, such as s3://dummy.
In a future release, we'll remove the need for a dummy crawler.- The text must be in JSON format and can contain a block per database that you want to integrate.
You can use any JSON validator to verify the format. Collibra is not responsible for the privacy, confidentiality, or protection of the data you submit to such JSON validators, and has no liability for such use. - In a block, you can specify the Glue database name, region, and domain ID that must be ingested. The format is:
"glueDbName": “the name of the AWS Glue database”"glueDbRegion": “the region of the AWS Glue database”"dgcDomainId": “the domain ID in Collibra where assets of the AWS Glue database must be added”
If you don't add the domain ID, the assets are added in the same domain as the S3 File System asset.
Example[{"glueDbName": "integrations-auto-1",
"glueDbRegion": "eu-west-1","dgcDomainId": "a3fe0607-65af-43d6-bc2c-7c3adae6e162"},{"glueDbName": "integrations-auto-2",
"glueDbRegion": "eu-west-1"}]In this example:
- Assets from the AWS Glue database "integrations-auto-1" will be ingested into the domain with ID "a3fe0607-65af-43d6-bc2c-7c3adae6e162".
- Assets from the AWS Glue database "integrations-auto-2" will be ingested into the same domain as the S3 File System asset.
No
Logging
Define if you want to create debug logs for the synchronization. If the value is True, the debug logs appear in the Collibra logs.
Yes
- Click Create.
The capability is added to the Edge site.
The fields become read-only.
What's next?
The Edge preparations are completed. You can now continue with setup steps to integrate an Amazon S3 file system via Edge.