Steps overview: Integrate AWS Glue via Edge

The integration steps vary slightly depending on how you choose to connect to your data source.

Integrate via Shared Storage connection

# Step Description
1 Review the preflight checks. Key considerations to help ensure successful integration, including required Edge, technical lineage, and data source-specific permissions, network requirements and more.
2

Set up Fluentd.

Use the following steps to configure Fluentd to receive and store OpenLineage events.

3 Set up OpenLineage Apache Spark integration and prepare the data source files for shared storage.

Install and configure the OpenLineage integration.

4

Create a Shared Storage connection.

A Shared Storage connection allows you to grant your capabilities access to files from a shared folder.

Important Shared Storage connection is not supported for Collibra Cloud sites.

5

Add the Technical Lineage for AWS Glue - OpenLineage capability for Shared Storage connections.

Add the technical lineage capability to your Edge or Collibra Cloud site. The capability allows Collibra Data Lineage to retrieve lineage information from your data source.
6 Synchronize the technical lineage.

You can synchronize your technical lineage manually or automatically by adding a synchronization schedule.

Integrate via Cloud Storage connection

When using a Cloud Storage connection, this guide describes the Fluentd-based workflow. The OpenLineage Spark agent can also be configured to emit events directly to a cloud storage location. For information about other transport options, go to the Apache Spark section in the OpenLineage documentation.

# Step Description
1 Review the preflight checks. Key considerations to help ensure successful integration, including required Edge, technical lineage, and data source-specific permissions, network requirements and more.
2 Set up Fluentd.

Use the following steps to configure Fluentd to receive and store OpenLineage events.

3

Set up OpenLineage Apache Spark integration and prepare the data source files for cloud storage.

Install and configure the OpenLineage integration.
4 If you use an AWS connection to access your data source files, prepare S3 for the connection.

Depending on your security requirements and where your Edge site is hosted, you can choose EC2 or IAM authentication types. Both authentication types require you to configure S3 permissions before creating the AWS connection.

5

Create a Cloud Storage connection.

For guidance on how to create a connection between your cloud-based storage system and Edge or Collibra Cloud site, go to the appropriate topic:

  • AWS Glue: Create an AWS connection to an Edge or Collibra Cloud site
  • AWS Glue: Create an Azure Data Lake Storage connection to an Edge or Collibra Cloud site
  • AWS Glue: Create a Google Cloud Platform connection to an Edge or Collibra Cloud site

6

Add the Technical Lineage for AWS Glue - OpenLineage (Cloud) capability for Cloud Storage connections.

Add the technical lineage capability to your Edge or Collibra Cloud site. The capability allows Collibra Data Lineage to retrieve lineage information from your data source.
7 Synchronize the technical lineage.

You can synchronize your technical lineage manually or automatically by adding a synchronization schedule.

What's next

After you synchronize the technical lineage, you can view the ingestion report. This shows the impact of technical lineage synchronization on the assets in Collibra.

Helpful resources