Airflow: Supported transformation details
Collibra Data Lineage uses the OpenLineage standard to create technical lineage for OpenLineage, Apache Airflow, and AWS Glue.Although Apache Airflow and AWS Glue appear as separate options in the interface, they rely on the same underlying OpenLineage framework. OpenLineage defines a standardized format for emitting metadata events that describe job execution and dataset interactions. Collibra Data Lineage processes these events to create technical lineage.
Apache Airflow and AWS Glue are listed separately because their setup steps have been validated and documented with dedicated setup instructions.
Collibra Data Lineage supports data sources listed in OpenLineage Naming Specification (Version 1.41.0) in the OpenLineage documentation. To ensure proper parsing and asset stitching, your OpenLineage events must follow the namespace format defined in that specification.
Function scope
Collibra Data Lineage provides visibility into how data flows at both table and column levels:
- Table-level lineage
Shows the inputs and outputs for each job.Tip To view table-level lineage for jobs, switch to the Objects view. This information is not available in the Attributes view. - Column-level lineage
Supported as described in Column Level Lineage Dataset Facet in the OpenLineage documentation. The level of support varies across integrations:Integration Column-level lineage support Apache Airflow Supported for specific classes.
For details, see Supported classes in the Airflow documentation.
AWS Glue Supported for Spark SQL DataFrames only, because the OpenLineage Spark plugin cannot extract data lineage from AWS Glue Spark Jobs that use AWS Glue DynamicFrames.
For details, see Data lineage in Amazon DataZone in the AWS documentation or Quickstart with AWS Glue in the OpenLineage documentation.
Other data sources Depends on how the lineage files are created and which facets are populated.
Lineage extraction mechanism
Collibra Data Lineage creates lineage from OpenLineage events captured during job execution.
- Event capture
For Airflow, Collibra Data Lineage uses the OpenLineage Airflow integration.
For AWS Glue, Collibra Data Lineage uses the OpenLineage Spark integration. - Advanced SQL analysis
Collibra Data Lineage parses and analyzes the SQL statements as part of the SQL Job Facet.
When OpenLineage files contain SQL statements that need to be analyzed for lineage extraction, Collibra Data Lineage parses and analyzes the SQL statements instead of using the OpenLineage SQL Parser. This is because Collibra Data Lineage supports more SQL dialects and advanced SQL features. - Stitching
Collibra Data Lineage uses thenamespaceandnameattributes in the events to stitch technical lineage to existing data assets in Data Catalog.