Supported transformation details
Collibra Data Lineage supports the most commonly used transformations in the following sources:
- Apache Airflow (via OpenLineage), AWS Glue (via OpenLineage), and OpenLineage
- Azure Data Factory
- Databricks Unity Catalog
- dbt
- Google Dataplex
- IBM DataStage
- Informatica PowerCenter
- Informatica Intelligent Cloud Services
- Snowflake
- SQL Server Integration Services
OpenLineage, Apache Airflow (via OpenLineage), and AWS Glue (via OpenLineage)
You can create technical lineage for OpenLineage on Edge. Collibra Data Lineage creates technical lineage for Airflow by using the OpenLineage Airflow integration and AWS Glue by using the OpenLineage Spark integration.
Collibra Data Lineage supports table-level lineage for jobs, which shows the inputs and outputs for each job.
Collibra Data Lineage also supports column-level lineage, as described in Column Level Lineage Dataset Facet in the OpenLineage documentation. The level of support varies across integrations. Additionally, Collibra Data Lineage parses and analyzes the SQL statements as part of the SQL Job Facet.
- Apache Airflow: Supports column-level lineage for specific classes. For details, see Supported classes in the Airflow documentation.
- AWS Glue: Supports column-level lineage for Spark SQL DataFrames only, because the OpenLineage Spark plugin cannot extract data lineage from AWS Glue Spark Jobs that use AWS Glue DynamicFrames. For details, see Data lineage in Amazon DataZone in the AWS documentation or Quickstart with AWS Glue in the OpenLineage documentation.
- OpenLineage: Support depends on how the lineage files are created.
When OpenLineage files contain SQL statements that need to be analyzed for lineage extraction, Collibra Data Lineage parses and analyzes the SQL statements instead of using the OpenLineage SQL Parser. This is because Collibra Data Lineage supports more SQL dialects and advanced SQL features.
Azure Data Factory
Collibra Data Lineage supports the most commonly used transformations and data sources in Azure Data Factory.
Supported transformations
The following tables shows a non-exhaustive list of supported and unsupported transformations.
Supported data sources
The following table shows a non-exhaustive list of supported sources with the corresponding dataset and linked service types.
CollibraData Lineage supports all data format types that are supported in Azure Data Factory, including binary, Excel file, Delimited text, JSON, Parquet, and so on.
Data sources |
Dataset type |
Linked service type |
---|---|---|
Amazon Redshift | AmazonRedshiftTable | AmazonRedshift |
Azure Blob storage | AzureBlob | AzureBlobStorage |
Azure Data Lake Storage Gen2 | AzureBlobFSFile | AzureBlobFS |
Azure Data Lake Store | AzureDataLakeStoreFile | AzureDataLakeStore |
Azure Databricks Delta Lake | AzureDatabricksDeltaLake | AzureDatabricksDeltaLake |
Azure SQL Managed Instance | AzureSqlMITable | AzureSqlMI |
Azure SQL Server database | AzureSqlTable | AzureSqlDatabase |
Azure Synapse Analytics | AzureSqlDWTable | AzureSqlDW |
DB2 data source | Db2Table | Db2 |
Google Cloud Storage | GoogleCloudStorageLocation | GoogleCloudStorage |
Microsoft Access | MicrosoftAccessTable | MicrosoftAccess |
Microsoft Azure Cosmos Database | CosmosDbSqlApiCollection | CosmosDb |
Open Database Connectivity (ODBC) | OdbcTable | Odbc |
On-premises Oracle database | OracleTable | Oracle |
REST | RestResource | RestService |
Salesforce | SalesforceObject | Salesforce |
Salesforce Marketing Cloud | SalesforceMarketingCloudObject | SalesforceMarketingCloud |
Salesforce Service Cloud | SalesforceServiceCloudObject | SalesforceServiceCloud |
SAP Business Warehouse (open hub) | SapOpenHubTable | SapBW |
SFTP server | SftpLocation | Sftp |
Snowflake | SnowflakeTable | Snowflake |
SQL Server | SqlServerTable | SqlServer |
Supported activity types
A Data Factory can have one or more pipelines. A pipeline is a logical grouping of activities that together perform a task. There are three groupings of activities: data movement activities, data transformation activities, and control activities. For a complete list of Azure Data Factory activity types and descriptions, see Microsoft's documentation on pipelines and activities.
Collibra Data Lineage currently supports the following activity types:
Activity type | Activity group |
---|---|
Copy | Data movement |
Data Flow | Data transformation |
Execute Pipeline | Control |
For Each | Control |
Get Metadata | Control |
If Condition | Control |
Lookup | Control |
Set Variable | Control |
Web Activity | Control |
Databricks Unity Catalog
-
Collibra Data Lineage retrieves lineage information from the lineage system tables that build on the Unity Catalog's data lineage feature, and visualizes lineage down to the column level. Specifically, Collibra Data Lineage ingests lineage for Databases, Schemas, Tables, and Columns, but does not ingest any other assets such as Notebooks or Workflows. So, while Collibra Data Lineage retrieves lineage information from notebooks, Collibra Data Lineage does not ingest or include the notebook assets in the technical lineage.
Note Currently, Databricks system tables don't include DLT (Delta Live Tables) column lineage. However, Collibra Data Lineage captures lineage for Databricks Streaming Tables and Materialized Views at both table and column levels.
-
Collibra Data Lineage retrieves lineage information from the lineage system tables and does not parse the language used to develop notebooks and jobs in Databricks to generate technical lineage. Therefore, you can use any supported language in Databricks. For examples of how Unity Catalog captures and presents data lineage, go to Capture and view data lineage with Unity Catalog in the Databricks documentation.
-
Collibra Data Lineage extracts column lineage from the
system.access.column_lineage
table in Databricks Unity Catalog. Since thesystem.access.column_lineage
table records lineage over time, Collibra Data Lineage ingests cumulative lineage for a given time frame rather than just the latest version. -
Collibra Data Lineage for Databricks Unity Catalog extracts SQL source code from Databricks Unity Catalog and includes the source code in the technical lineage viewer. To extract source code, ensure that the
system.query.history
system table is enabled. SQL source code is captured and becomes accessible only once thesystem.query.history
table is enabled. -
Collibra Data Lineage for Databricks Unity Catalog supports external delta tables referenced by external paths.
ExampleIf the following SQL is used in Databircks Unity Catalog, lineage will be created in Collibra.
CREATE OR REPLACE TABLE table_from_direct_delta_query AS (SELECT * FROM delta.`s3://kktesting/testfolder`)
dbt
Collibra Data Lineage supports the following adapters in dbt:
- Azure Synapse
- Databricks
- Google BigQuery
- Greenplum
- Hive
- IBM Db2
- Microsoft SQL Server
- MySQL
- Oracle
- Postgres
- Redshift
- Snowflake
- Spark
- Teradata
dbt Cloud
Collibra Data Lineage supports materialization, and tables and views are treated like tables by default. You can customize the setting in one of the following ways so that the tables and views are treated like views:
- If you use technical lineage via Edge, specify the
materializedMapping
property in the Source Configuration field in the Technical Lineage for dbt Cloud capability. - If you use the lineage harvester, specify the
materializedMapping
property in the <source ID> configuration file.
Google Dataplex
- Collibra Data Lineage visualizes lineage for Google Dataplex down to table level. To view the technical lineage for Google Dataplex, ensure that you select Objects in the toolbar of your technical lineage graph.
- Collibra Data Lineage ingests lineage from BigQuery and other Google Cloud services supported by the data lineage feature in Dataplex. However, only the lineage for Column, Table, and File assets is processed and included in the technical lineage for Dataplex.
- Technical lineage for Google Dataplex can start from GCS or BigQuery and end in BigQuery.
- You can choose to create table-level lineage or column-level lineage for Google Dataplex when you synchronize the Technical Lineage for Google Dataplex capability. Stitching works for the column-level lineage, regardless of whether you integrated Google Dataplex Catalog or registered Google BigQuery databases by using the BigQuery JDBC connector.
- Transformations are ingested by calling the GCP Process and subsequently the GCP Jobs. Therefore, the Service Account user that is defined in the Edge connection requires, at a minimum, the
bigquery.jobs.get
permission, and optionally thebigquery.admin
role, which lets the capability ingest the details of all the jobs in the project.
Differences between technical lineage for Google Dataplex and Google BigQuery
You can create technical lineage for Google BigQuery by using a JDBC connection or for Google Dataplex by using a Google Cloud Platform (GCP) connection. Consider the following differences to determine which data source and connection type to use.
Feature | Support in technical lineage for Google Dataplex | Support in technical lineage for Google BigQuery |
---|---|---|
SQL transformation code | Yes when creating column-level lineage | Yes |
Executed SQL in stored procedures | Yes | No |
Ingest lineage from... |
BigQuery and other Google Cloud services supported by the data lineage feature in Dataplex |
BigQuery |
IBM DataStage
IBM DataStage uses jobs with stages instead of transformations. IBM Datastage has three job types: parallel jobs, sequence jobs and server jobs. For a list of all job stages per job type in IBM DataStage, read the IBM documentation.
Technical lineage for DataStage supports the following parameters and expressions:
-
Runtime parameters in parameter set files.To include the runtime parameters, ensure to export DataStage files with executables. For more information about exporting DataStage files, go to Prepare an external directory folder for the lineage harvester if you use the lineage harvester, or Create a technical lineage via Edge for DataStage.
- Parameter sets.To include parameters, export the parameter sets as part of your environment file. For more information about exporting DataStage files, go to Prepare an external directory folder for the lineage harvester if you use the lineage harvester, or Create a technical lineage via Edge for DataStage.
- Expression format. The analysis result displays the DATASTAGE_EXPRESSION message when a complex format with advanced functions is parsed.
Informatica PowerCenter transformations
The following table shows a non-exhaustive list of supported and unsupported transformations in Informatica PowerCenter.
Supported transformations |
Unsupported transformations |
---|---|
|
|
|
Informatica Intelligent Cloud Services
The following table shows a non-exhausitive list of supported taskflows and unsupported tasks in Informatica Intelligent Cloud Services.
Supported taskflows |
Unsupported tasks |
---|---|
|
|
The following table shows a non-exhaustive list of supported and unsupported transformations and constructions in Informatica Intelligent Cloud Services. Specifically, transformations and constructions in the Cloud Data Integration service.
Supported transformations |
Unsupported transformations, functions and constructions |
---|---|
|
|
Snowflake
You can create technical lineage for Snowflake by using SQL Snowflake ingestion mode or SQL-API Snowflake ingestion mode. Collibra Data Lineage supports different queries and transformations for each ingestion method. For more information about the ingestion methods, go to Technical lineage for Snowflake ingestion methods.
SQL Snowflake ingestion mode
With the SQL Snowflake ingestion mode, Collibra Data Lineage does not support the following non-exhaustive list of transformations:
- Snowpark
SQL-API Snowflake ingestion mode
With the SQL-API Snowflake ingestion mode, Collibra Data Lineage supports the Data Manipulation Language (DML) statements from the following sources. The table also shows a non-exhaustive list of unsupported queries and transformations.
Supported transformations |
Unsupported queries and transformations |
---|---|
|
|
Note
|
SQL Server Integration Services (SSIS)
Collibra Data Lineage supports the following non-exhaustive list of transformations and component types in SQL Server Integration Services:
Supported transformations |
Supported component types |
---|---|
|
|
- Collibra Data Lineage supports SQL, but cannot parse other languages or scripts, for example SHELL and BAT scripts.
- SQL statements from Excel are not supported.
- Collibra Data Lineage does not create lineage for disabled executables.