Add the S3 synchronization capability

Important

In Collibra 2024.05, we launched a new user interface (UI) for Collibra Platform! You can learn more about this latest UI in the UI overview.

Use the following options to see the documentation in the latest UI or in the previous, classic UI:

Latest UI Classic UI

Before you begin

You either created and installed an Edge site or were granted a Collibra Cloud site.
You have prepared Edge for the S3 integration.
You have created an S3 connection

Required permissions

You have a global role that has the Manage connections and capabilities global permission, for example, Edge integration engineer.

Steps

Open a site.
1. On the main toolbar, click → Settings.
  The Settings page opens.
2. In the tab pane, click Edge.
  The Sites tab opens and shows a table with an overview of your sites.
3. In the table, click the name of the site whose status is Healthy.
  The site page opens.
In the Capabilities section, click Add capability.
The Add capability page is shown.
Select the S3 synchronization capability template.

Enter the required information.

Field	Description	Required
Capability	This section contains general information about the capability.
Name	The name of the capability.	Yes
Description	The description of the capability.	No
Capability template	The capability template. The value that you select in this field determines which sections appear on the page. Select the following capability: `S3 synchronization`	Yes
S3 service account	This section contains information about how to connect to Amazon S3.
AWS Connection	The AWS connection to be used.	Yes
IAM role	The IAM role to be used by the AWS Glue crawlers.	Yes
Delete Glue database left after previous synchronization of the file system	Select the checkbox if you want the capability to delete the Glue databases created by previous runs of the capability, before the capability starts the synchronization. If you deselect this checkbox, the Glue databases created by previous runs are not removed. This can be useful for troubleshooting. By default, this checkbox is selected.	No
Save input metadata	Select the checkbox if you want to save the input metadata extracted from the data source in ZIP files. The files can be useful for troubleshooting. Select this option only on request of Collibra Support. The Collibra Support team can provide the location of the saved ZIP files after the S3 synchronization. By default, this checkbox is not selected.	No
Finalization Strategy	Define what you want to do if an asset has been deleted from the S3 data source after an initial synchronization. The possible values are: Change Status (default): If an asset has been deleted from the S3 data source after an initial synchronization, we update the status of the asset in Collibra to "Missing from source". Remove Resources: If an asset has been deleted from the S3 data source after an initial synchronization, we remove the asset from Collibra. Ignore: If an asset has been deleted from the S3 data source after an initial synchronization, we don't change anything for the asset in Collibra.	Yes
Logging parameter	You can use this field to customize the debug logging. Important Only complete this field on request of or together with Collibra Support.	No
Custom parameter	Use this field to define that you want to ingest File Group assets as File assets. Name: file-group-as-file Type: Text Encryption: Not encrypted (Plain text) Value: true Type: Text Value Type: Plaintext Name: file-group-as-file Value: true	No
Glue database configuration Glue database configuration	Text in JSON format to define the Glue database names, regions, and domain IDs that you want to integrate. Tip Use this parameter if the current S3 synchronization crawler configuration doesn’t meet your needs. With this parameter, you can integrate an AWS Glue database for which you defined crawlers in AWS Glue itself. This allows you to use all crawler options from the AWS Glue Console. If you use this parameter, you don't need to create crawlers in Collibra. Important If you use this parameter, any crawlers you create in Collibra will not be taken into account during the S3 synchronization. You, however, will need to create a dummy crawler in Collibra to start the synchronization. A dummy crawler is a crawler with an invalid include path, such as s3://dummy. In a future release, we'll remove the need for a dummy crawler. The text must be in JSON format and can contain a block per database that you want to integrate. You can use any JSON validator to verify the format. Collibra is not responsible for the privacy, confidentiality, or protection of the data you submit to such JSON validators, and has no liability for such use. In a block, you can specify the Glue database name, region, and domain ID that must be ingested. The format is: `"glueDbName": “the name of the AWS Glue database”` `"glueDbRegion": “the region of the AWS Glue database”` `"dgcDomainId": “the domain ID in Collibra where assets of the AWS Glue database must be added”` If you don't add the domain ID, the assets are added in the same domain as the S3 File System asset. Example [ { "glueDbName": "integrations-auto-1", "glueDbRegion": "eu-west-1", "dgcDomainId": "a3fe0607-65af-43d6-bc2c-7c3adae6e162" }, { "glueDbName": "integrations-auto-2", "glueDbRegion": "eu-west-1" } ] In this example: Assets from the AWS Glue database "integrations-auto-1" will be ingested into the domain with ID "a3fe0607-65af-43d6-bc2c-7c3adae6e162". Assets from the AWS Glue database "integrations-auto-2" will be ingested into the same domain as the S3 File System asset.	No
Advanced Configuration Logging configuration Memory JVM arguments	These configuration options help when investigating issues with the capability. Important Only complete the fields Save Input Metadata, Logging configuration, Memory (MiB), and JVM arguments on request of or together with Collibra Support. Only use Log level if your data source is a commercial JDBC offering. For more information, go to the Collibra Marketplace.	No
Debug	This setting is not valid for this integration. It should be set to false. An option to automatically send Edge infrastructure log files to Collibra Platform. By default, this option is set to false. Note We highly recommend to only send Edge infrastructure log files to Collibra Platform when you have issues with Edge. If you set it to true, it will automatically revert to false after 24h.	No
Log level	An option to determine the verbosity of the log files. The default value is `No logging`.	No

Click Create.
The capability is added to the Edge or Collibra Cloud site.
The fields become read-only.

What's next?

The Edge preparations are completed. You can now continue with setup steps to integrate an Amazon S3 file system via Edge.