Steps overview: Integrate Apache Airflow via Edge

The integration steps vary slightly depending on how you choose to connect to your data source.

Integrate via Shared Storage connection

# Step Description
1 Review the preflight checks. Key considerations to help ensure successful integration, including required Edge, technical lineage, and data source-specific permissions, network requirements and more.
2 Set up OpenLineage Airflow integration.

Install and configure the OpenLineage integration.

3 Set up Fluentd and prepare the data source files for shared storage.

After you configure your software to emit OpenLineage messages, use Fluentd to collect these messages. Fleuntd is the data collector that the OpenLineage community prefers.

4

Create a Shared Storage connection.

A Shared Storage connection allows you to grant your capabilities access to files from a shared folder.

Important Shared Storage connection is not supported for Collibra Cloud sites.

5

Add the Technical Lineage for Airflow - OpenLineage capability for Shared Storage connections.

Add the technical lineage capability to your Edge or Collibra Cloud site. The capability allows the lineage harvester to retrieve data from your data source.
6 Synchronize the technical lineage.

You can synchronize your technical lineage manually or automatically by adding a synchronization schedule.

Integrate via Cloud Storage connection

# Step Description
1 Review the preflight checks. Key considerations to help ensure successful integration, including required Edge, technical lineage, and data source-specific permissions, network requirements and more.
2 Set up OpenLineage Airflow integration.

Install and configure the OpenLineage integration.

3 Set up Fluentd and prepare the data source files for cloud storage. After you configure your software to emit OpenLineage messages, use Fluentd to collect these messages. Fleuntd is the data collector that the OpenLineage community prefers.
4

Create a Cloud Storage connection.

For guidance on how to create a connection between your cloud-based storage system and Edge or Collibra Cloud site, go to the appropriate topic:

  • Airflow: Create an AWS connection to an Edge or Collibra Cloud site
  • Airflow: Create an Azure Data Lake Storage connection to an Edge or Collibra Cloud site
  • Airflow: Create a Google Cloud Platform connection to an Edge or Collibra Cloud site

5

Add the Technical Lineage for Airflow - OpenLineage (Cloud) capability for Cloud Storage connections.

Add the technical lineage capability to your Edge or Collibra Cloud site. The capability allows the lineage harvester to retrieve data from your data source.
6 Synchronize your technical lineage.

You can synchronize your technical lineage manually or automatically by adding a synchronization schedule.

What's next

After you synchronize the technical lineage, you can view the ingestion report. This shows the impact of technical lineage synchronization on the assets in Collibra.

Helpful resources