Steps overview: Integrate Apache Airflow via Edge
The integration steps vary slightly depending on how you choose to connect to your data source.
Integrate via Shared Storage connection
| # | Step | Description |
|---|---|---|
| 1 | Review the preflight checks. | Key considerations to help ensure successful integration, including required Edge, technical lineage, and data source-specific permissions, network requirements and more. |
| 2 | Set up OpenLineage Airflow integration. |
Install and configure the OpenLineage integration. |
| 3 | Set up Fluentd and prepare the data source files for shared storage. |
After you configure your software to emit OpenLineage messages, use Fluentd to collect these messages. Fleuntd is the data collector that the OpenLineage community prefers. |
| 4 |
Create a Shared Storage connection. |
A Shared Storage connection allows you to grant your capabilities access to files from a shared folder. Important Shared Storage connection is not supported for Collibra Cloud sites. |
| 5 |
Add the Technical Lineage for Airflow - OpenLineage capability for Shared Storage connections. |
Add the technical lineage capability to your Edge or Collibra Cloud site. The capability allows the lineage harvester to retrieve data from your data source. |
| 6 | Synchronize the technical lineage. |
You can synchronize your technical lineage manually or automatically by adding a synchronization schedule. |
Integrate via Cloud Storage connection
| # | Step | Description |
|---|---|---|
| 1 | Review the preflight checks. | Key considerations to help ensure successful integration, including required Edge, technical lineage, and data source-specific permissions, network requirements and more. |
| 2 | Set up OpenLineage Airflow integration. |
Install and configure the OpenLineage integration. |
| 3 | Set up Fluentd and prepare the data source files for cloud storage. | After you configure your software to emit OpenLineage messages, use Fluentd to collect these messages. Fleuntd is the data collector that the OpenLineage community prefers. |
| 4 |
Create a Cloud Storage connection. |
For guidance on how to create a connection between your cloud-based storage system and Edge or Collibra Cloud site, go to the appropriate topic: |
| 5 |
Add the Technical Lineage for Airflow - OpenLineage (Cloud) capability for Cloud Storage connections. |
Add the technical lineage capability to your Edge or Collibra Cloud site. The capability allows the lineage harvester to retrieve data from your data source. |
| 6 | Synchronize your technical lineage. |
You can synchronize your technical lineage manually or automatically by adding a synchronization schedule. |
After you synchronize the technical lineage, you can view the ingestion report. This shows the impact of technical lineage synchronization on the assets in Collibra.
Helpful resources
- Airflow integration preflight checks
- Architecture for creating technical lineage for Airflow
- Edge harvester network requirements
- Connect to a Collibra Data Lineage service instance via OAuth authentication
- Connect to a proxy server
- Airflow: Supported transformation details
- Automatic stitching for technical lineage
- Technical lineage admin options
- Delete a technical lineage