Add the S3 synchronization capability

Important 

In Collibra 2024.05, we launched a new user interface (UI) for Collibra Data Intelligence Platform! You can learn more about this latest UI in the UI overview.

Use the following options to see the documentation in the latest UI or in the previous, classic UI:

Before you begin

Required permissions

You have a global role that has the Manage connections and capabilities global permission, for example, Edge integration engineer.

Steps

  1. Open an Edge site.
    1. On the main toolbar, click Products icon, and then click Cogwheel icon Settings.
      The Collibra settings page opens.
    2. In the tab pane, click Edge.
      The Sites tab opens and shows a table with an overview of the Edge sites.
    3. In the table, click the name of the Edge site whose status is Healthy.
      The Edge site page opens.
  2. In the Capabilities section, click Add capability.
    The Add capability page is shown.
  3. Select the S3 synchronization capability template.
  4. Enter the required information.
    FieldDescriptionRequired

    Capability

    This section contains general information about the capability.

    Name

    The name of the Edge capability.

    Yes

    Description

    The description of the Edge capability.

    No

    Capability template

    The capability template. The value that you select in this field determines which sections appear on the page.

    Select the following Edge capability:

    S3 synchronization

    Yes

    S3 service account

    This section contains information about how to connect to Amazon S3.
    AWS Connection
    The AWS connection to be used.

    Yes

    IAM role
    The IAM role to be used by the AWS Glue crawlers.

    Yes

    Delete Glue database left after previous synchronization of the file system

    Select the checkbox if you want the capability to delete the Glue databases created by previous runs of the capability, before the capability starts the synchronization.
    If you deselect this checkbox, the Glue databases created by previous runs are not removed. This can be useful for troubleshooting.

    By default, this checkbox is selected.

    No

    Save input metadata

    Select the checkbox if you want to save the input metadata extracted from the data source in ZIP files. The files can be useful for troubleshooting.
    Select this option only on request of Collibra Support. The Collibra Support team can provide the location of the saved ZIP files after the S3 synchronization.

    By default, this checkbox is not selected.

    No

    Finalization Strategy

    Define what you want to do if an asset has been deleted from the S3 data source after an initial synchronization.
    The possible values are:

    • Change Status (default): If an asset has been deleted from the S3 data source after an initial synchronization, we update the status of the asset in Collibra to "Missing from source".
    • Remove Resources: If an asset has been deleted from the S3 data source after an initial synchronization, we remove the asset from Collibra.
    • Ignore: If an asset has been deleted from the S3 data source after an initial synchronization, we don't change anything for the asset in Collibra.

    Yes

    Logging parameter

    You can use this field to customize the debug logging.

    Important Only complete this field on request of or together with Collibra Support.

    No

    Custom parameter

    Use this field to define that you want to ingest File Group assets as File assets.

    • Name: file-group-as-file
    • Type: Text
    • Encryption: Not encrypted (Plain text)
    • Value: true
    • Type: Text
    • Value Type: Plaintext
    • Name: file-group-as-file
    • Value: true

    No

    Glue database configuration

    Glue database configuration

    Text in JSON format to define the Glue database names, regions, and domain IDs that you want to integrate.

    Tip  Use this parameter if the current S3 synchronization crawler configuration doesn’t meet your needs. With this parameter, you can integrate an AWS Glue database for which you defined crawlers in AWS Glue itself. This allows you to use all crawler options from the AWS Glue Console. If you use this parameter, you don't need to create crawlers in Collibra.

    Important  If you use this parameter, any crawlers you create in Collibra will not be taken into account during the S3 synchronization. You, however, will need to create a dummy crawler in Collibra to start the synchronization. A dummy crawler is a crawler with an invalid include path, such as s3://dummy.
    In a future release, we'll remove the need for a dummy crawler.

    • The text must be in JSON format and can contain a block per database that you want to integrate.
      You can use any JSON validator to verify the format. Collibra is not responsible for the privacy, confidentiality, or protection of the data you submit to such JSON validators, and has no liability for such use.
    • In a block, you can specify the Glue database name, region, and domain ID that must be ingested. The format is:
      • "glueDbName": “the name of the AWS Glue database”
      • "glueDbRegion": “the region of the AWS Glue database”
      • "dgcDomainId": “the domain ID in Collibra where assets of the AWS Glue database must be added”
        If you don't add the domain ID, the assets are added in the same domain as the S3 File System asset.

    Example 
    [
    	{
    		"glueDbName": "integrations-auto-1",
    		"glueDbRegion": "eu-west-1",
    		"dgcDomainId": "a3fe0607-65af-43d6-bc2c-7c3adae6e162"
    	},
    	{
    		"glueDbName": "integrations-auto-2",
    		"glueDbRegion": "eu-west-1"
    	}
    ]

    In this example:

    • Assets from the AWS Glue database "integrations-auto-1" will be ingested into the domain with ID "a3fe0607-65af-43d6-bc2c-7c3adae6e162".
    • Assets from the AWS Glue database "integrations-auto-2" will be ingested into the same domain as the S3 File System asset.


    No

    Advanced Configuration
    • Logging configuration
    • Memory
    • JVM arguments

    These configuration options help when investigating issues with the capability.

    Important Only complete the fields Save Input Metadata, Logging configuration, Memory (MiB), and JVM arguments on request of or together with Collibra Support.

    No

    Debug

    This setting is not valid for this integration. It should be set to false.

    An option to automatically send Edge infrastructure log files to Collibra Data Intelligence Platform. By default, this option is set to false.

    Note We highly recommend to only send Edge infrastructure log files to Collibra Data Intelligence Platform when you have issues with Edge. If you set it to true, it will automatically revert to false after 24h.

    No

    Log level

    An option to determine the verbosity of the log files. The default value is No logging.

    No

  5. Click Create.
    The capability is added to the Edge site.
    The fields become read-only.

What's next?

The Edge preparations are completed. You can now continue with setup steps to integrate an Amazon S3 file system via Edge.