Create a technical lineage via the lineage harvester

This topic describes the general steps on how to use the lineage harvester to create a technical lineage.

Select a data source, to show the relevant integration steps.

Currently, information is shown for:

Requirements and permissions

Collibra Data Intelligence Platform.
You have purchased Collibra Data Lineage.
A global role with the following global permissions:
- Catalog, for example Catalog Author
- Data Stewardship Manager
- Manage all resources
- System administration
- Technical lineage
A resource role with the following resource permissions on the community level in which you created the domain:
- Asset: add
- Attribute: add
- Domain: add
- Attachment: add

Important Amazon RedshiftAzure SQL serverAzure Synapse AnalyticsGreenplumHiveIBM Db2PostgreSQLMicrosoft SQL ServerMySQLNetezzaSAP HANATeradata requirements:

Ensure that you meet the Azure Data Factory-specific permissions described in Set up Azure Data Factory.

You need read access on information_schema. Only views that you own are processed.

You need read access on the SYS schema.

If you are using the lineage harvester, you need read access on information_schema:

bigquery.datasets.get
bigquery.tables.get
bigquery.tables.list
bigquery.jobs.create
bigquery.routines.get
bigquery.routines.list

If you are using Edge, you also need:

resourcemanager.projects.get
bigquery.readsessions.create
bigquery.readsessions.getData

SELECT, at table level. Grant this to every table for which you want to create a technical lineage.
Read access to the SYS schema or the tables in the schema.

Ensure that your service account token has the Read-Only permission.

Ensure that you have the permission to copy the target/ directory, which is generated by running the dbt compile command, to a local folder.

You need Monitoring role permissions.

To create technical lineage from calculated views in an SAP HANA Classic on-premises data source, you also need the following permissions:

SELECT on the following views:
- _SYS_REPO.ACTIVE_OBJECT
- _SYS_REPO.ACTIVE_OBJECTCROSSREF
- SYS.OBJECT_DEPENDENCIES
The CATALOG READ system privilege

A role with the LOGIN option.

SELECT WITH GRANT OPTION, at Table level.

CONNECT ON DATABASE

You need read access on the SYS schema and the View Definition Permission in your SQL Server.

You need read access on definition_schema.

GRANT SELECT, at table level. Grant this to every table for which you want to create a technical lineage.
The role of the user that you specify in the username property in lineage harvester configuration file must be the owner of the views in PostgreSQL.

You need read access on the DBC.

You need read access to the following dictionary views:

all_tab_cols
all_col_comments
all_objects
ALL_DB_LINKS
all_mviews
all_source
all_synonyms
all_views

Your user role must have privileges to export assets.
You must have read permission on all assets that you want to export.

You have at least a Matillion Enterprise license.
You have generated the Matillion certificate. For more information, go to Recreating self-signed SSL certificates on a Matillion ETL instance.
You have added the Matillion certificate to a Java truststore. For more information about adding a certificate to a Java truststore, go to Add a Certificate to a Truststore Using Keytool.
If you encounter the javax.net.ssl.SSLHandshakeException: General SSLEngine problem error message, go to Data Source Name Failed exception with Tableau & technical lineage in Collibra Support Portal for a solution.

The following permissions are required, regardless of the ingestion mode: SQL or SQL-API.

Ensure that the Snowflake user has the appropriate allowed host list. For details, go to Allowing Hostnames in Snowflake documentation.
You need a role that can access the Snowflake shared read-only database. To access the shared database, the account administrator must grant the IMPORTED PRIVILEGES privilege on the shared database to the user that runs the lineage harvester.
If the default role in Snowflake does not have the IMPORTED PRIVILEGES privilege, you can use the customConnectionProperties property in the lineage harvester configuration file to assign the appropriate role to the user. For example:
"customConnectionProperties": "role=METADATA"

The source code files must be in the same directory as your JSON files. For complete information, go to Working with custom technical lineage.
To stitch the data objects of your data sources with Data Catalog assets, you need to register your data sources in Data Catalog. When you then prepare the Data Catalog physical data layer, ensure that you use a structure that matches the structure of ingested assets in Data Catalog.
Determine whether you want to use the single-file or batch definition option.
If you choose the single-file definition option, determine whether you want to create a simple or advanced custom technical lineage.

Before you start the Power BI integration process, you have to perform a number of tasks in Power BI and Microsoft Azure. These tasks, which are performed outside of Collibra, are needed to enable the lineage harvester to reach your Power BI application and collect its metadata. For complete information, go to Set up Power BI.

Collibra Data Lineage supports:

Power BI on the Microsoft Power Platform.
Power BI on Fabric.

The configuration requirements and the integration are the same, regardless of your setup.

Before you start the Tableau integration process, you have to perform a number of tasks in Tableau. For complete information, go to the following topics:

You need the following roles, with user access to the server from which you want to ingest:

A system-level role that is at least a System user role.
An item-level role that is at least a Content Manager role.

We recommend that you use SQL Server 2019 Reporting Services or newer. We can't guarantee that older versions will work.

Before you start the Looker integration process, you need to set up Looker.

You need the following Admin API permissions:
1. The first call we make to MicroStrategy is to authenticate. We connect to:
  <MSTR URL>:<Port>/MicroStrategyLibrary/api-docs/ and use GET api/auth/login.
  For complete information, see the MicroStrategy documentation.
  If this API call can be made successfully, you can ingest the metadata.
2. The same connection:
  <MSTR URL>:<Port>/MicroStrategyLibrary/api-docs/, but with GET api/model/tables/<tableId>.
  For complete information, see the MicroStrategy documentation.
  This endpoint is needed to create lineage and stitching.
You need permissions to access the library server.
The lineage harvester uses port 443. If the port is not open, you also need permissions to access the repository.
You have to configure the MicroStrategy Modeling Service. For complete information, see the MicroStrategy documentation.

Warning

Collibra Data Lineage uses the API 4.0 endpoints GET /queries/<query_id> and GET /running_queries. Due to a security update by Looker, the behavior of these endpoints has changed. Therefore, you must now:

Select the "Disallow Numeric Query IDs" option in Looker.
Ensure that your Looker user has the Admin role.

For complete information, see the Looker Query ID API Patch Notice.

Steps

Note Only Basic Authentication is supported. NTLM authentication, for example, is not.

Optionally, connect to a proxy server.
Ensure that you meet the Azure Data Factory prerequisites.
Ensure that you have the correct Tableau versions and permissions, as described in the Set up Tableau topics.
Complete the tasks in Power BI and Microsoft Azure, as described in the Set up Power BI topics.
If you are a MicroStrategy on-premises customer, ensure that you have enabled Collibra to access your MicroStrategy data, as described in Set up MicroStrategy.
Ensure that you have API3 credentials for authorization and access control. For complete information, go to Set up Looker.
Prepare the Data Catalog physical data layer.
Prepare an external directory folder for the lineage harvester.
Prepare a domain for BI asset ingestion.
Optionally, assign the attribute type State to the global assignment of the Power BI Workspace asset type. For complete information, go to Power BI workspaces.
Download and install the lineage harvester.
Prepare the lineage harvester configuration file.
Note The project name in the configuration file must be the same as the full name of the Database asset.
If necessary, prepare a <source ID> configuration file.
Manually refresh your Power BI datasets.
Important The first time you integrate Power BI, you need to make sure that the data in your Power BI datasets is up-to-date. Carry out this step only if this is the first time you're integrating Power BI in Data Catalog. After that, Microsoft automatically refreshes the datasets every 90 days. For complete information, see:
- The Microsoft documentation.
- The Microsoft Power BI Blog.
Run the lineage harvester.

You can define your Custom technical lineage via batch or single-file definition.

Batch definition
Single-file definition

Create a local folder.
Create the following:
- A single metadata file.
  Name the file metadata.json.
- Optionally, one or more assets files. These are required if you want to achieve stitching.
  Name the files assets<something unique>.json.
- One or more lineage files.
  Name the files lineage<something unique>.json.
- Optionally, a folder for your source code files. This folder must be put in the same local folder.
For guidance on creating these files, go to custom technical lineage JSON file examples.

Prepare a lineage harvester configuration file for custom technical lineage.

Requirements and restrictions

In the configuration file, you must use UTF-8 or ISO-8859-1 characters, with the exception of SQL files, which can only be UTF-8 encoded.
Comments in the lineage harvester configuration file are not supported.
Technical lineage supports the username and password authentication method for the custom technical lineage.

Format

{
	"general" : {
		"catalog" : {
			"url" : "",
			"username" : "",
		},
		"useCollibraSystemName" : false|true
	},
	"sources" : [ {
		"type" : "ExternalDirectory",
		"id" : "",
		"dirType" : "custom-lineage",
		"collibraSystemName" : "",
		"path" : "",
		"deleteRawMetadataAfterProcessing": false|true
	} ]
}

Properties	Description
general	Describes the connection between Collibra Data Lineage and Data Catalog.
catalog	Contains information that is necessary to connect to Data Catalog.
url	The public URL of your Collibra environment. Other URLs are not accepted.
username	The username that you use to sign in Collibra.
useCollibraSystemName	Indicates whether you want to use the system or server name of a data source to match to the System asset you created when you prepared the physical data layer. The names are case-sensitive. This is useful when you have multiple databases with the same name. Specify one of the following values: `false` The lineage harvester does not stitch the system or server name of your data source to the System asset in Data Catalog. This is the default value. `true` The lineage harvester reads the system or server names that you specify for the system data object in your assets or lineage JSON files and stitches the names to the System assets in Data Catalog. Only specify this value if you have multiple databases with the same name.
sources	Contains the information needed to retrieve a custom lineage. Use this section to specify the location of the JSON files that define the custom technical lineage. If you want to create the technical lineage for multiple data sources, create a `sources` section for each data source.
type	The kind of data source. The value must be `ExternalDirectory`.
id	This property is used to identify the batch of harvested metadata on the Collibra Data Lineage service instance. The value can be anything as long as it is unique and human readable. For example, `MyCustomLineage`.
dirType	The type of external directory. The value must be `custom-lineage`.
collibraSystemName	The lineage harvester ignores this property for custom technical lineage. To use the system or server name of your data source to match the System asset in Data Catalog, specify the system data object in your assets or lineage JSON files.
path	The full path to the folder of the custom technical lineage JSON files, for example `C:\path\to\custom-lineage\dir`.
deleteRawMetadataAfterProcessing	The lineage harvester harvests raw metadata from specified data sources and uploads it in a ZIP file to a Collibra Data Lineage service instance, for processing. You can use this optional property to specify whether or not the raw metadata should be deleted from Collibra Data Lineage service instance after the metadata that is targeted for ingestion in Data Catalog is processed. The default value is `false`. If the property is set to `true`, the raw source metadata is deleted after processing. If set to `false`, it is stored in the Collibra infrastructure. Note Setting this property to `true` can negatively impact performance.

Example

{
	"general" : {
		"catalog" : {
			"url" : "https://companydomain.collibra.com",
			"username" : "my-Collibra-username",
		},
		"useCollibraSystemName" : false
	},
	"sources" : [{
           "id": "MyCustomLineage",
			"type": "ExternalDirectory",
			"dirType": "custom-lineage",
			"path”: “/path/to/custom-lineage/dir/",
			"collibraSystemName": "MySystemName"
			}
			]
}

Run the lineage harvester.
Before you begin

If you use a proxy server, connect to the proxy server. For more information, go to Connecting to a proxy server.

Requirements and permissions
- Collibra Data Intelligence Platform.
- You have purchased Collibra Data Lineage.
- A global role with the following global permissions:
  - Catalog, for example Catalog Author
  - Data Stewardship Manager
  - Manage all resources
  - System administration
  - Technical lineage
- A resource role with the following resource permissions on the community level in which you created the domain:
  - Asset: add
  - Attribute: add
  - Domain: add
  - Attachment: add
Steps
1. Start the lineage harvester by entering the full-sync command.
  - To process data from all data sources in the configuration file:
    For windows:
    .\bin\lineage-harvester.bat full-sync
    For other operating systems:
    ./bin/lineage-harvester full-sync
  - To process data from specific data sources in the configuration file:
    For windows:
    .\bin\lineage-harvester.bat full-sync -s "ID of the data source"
    For other operating systems:
    ./bin/lineage-harvester full-sync -s "ID of the data source"
  Tip For more information and command options, go to Lineage harvesting app command options and arguments.
  Note
  If you have Snowflake data sources in your lineage harvester configuration file, set the JAVA_OPTS environment variable first. For example, to process data from all data sources including the Snowflake data sources, take the following steps:
  On Windows
  Enter one of the following commands:
  If you use OpenJDK 16:
  set JAVA_OPTS="-Djdk.module.illegalAccess=permit"
  If you use OpenJDK 17 or higher:
  set JAVA_OPTS="--add-opens=java.base/java.nio=ALL-UNNAMED"
  In the same command line, enter the following command:
  .\bin\lineage-harvester.bat full-sync
  Note The set command is specific to the Windows Command Shell. The command is different if you are using PowerShell.
  On Linux
  Enter the following command:
  If you use OpenJDK 16:
  JAVA_OPTS="-Djdk.module.illegalAccess=permit" ./bin/lineage-harvester full-sync
  If you use OpenJDK 17 or higher:
  JAVA_OPTS="--add-opens=java.base/java.nio=ALL-UNNAMED" ./bin/lineage-harvester full-sync
2. When prompted, enter the passwords to connect to Collibra and your data sources. Do one of the following:
  - Enter the passwords in the console.
    The passwords are encrypted and stored in /config/pwd.conf.
  - Provide the passwords via command line.
    The passwords are stored locally and not in your lineage harvester folder.
3. If you are creating technical lineage for dbt Cloud and prompted to enter your API token, enter the token value for the service token that you specified for the tokenName property in the lineage harvester configuration file for dbt Cloud.

Create a local folder.
Create a JSON file in the local folder and name the JSON file lineage.json.
For guidance on creating the lineage.json, go to custom technical lineage JSON file examples.
Note The JSON file must be named as lineage.json; otherwise, the process fails. You can have other types of files in this folder.
If you want to create an advanced custom technical lineage, store all of the source code files that you want to reference in the JSON file in the same local folder. For more information about the simple and advanced custom technical lineage, go to Custom technical lineage.

Prepare a lineage harvester configuration file for custom technical lineage.

Requirements and restrictions

In the configuration file, you must use UTF-8 or ISO-8859-1 characters, with the exception of SQL files, which can only be UTF-8 encoded.
Comments in the lineage harvester configuration file are not supported.
Technical lineage supports the username and password authentication method for the custom technical lineage.

Format

{
	"general" : {
		"catalog" : {
			"url" : "",
			"username" : "",
		},
		"useCollibraSystemName" : false|true
	},
	"sources" : [ {
		"type" : "ExternalDirectory",
		"id" : "",
		"dirType" : "custom-lineage",
		"collibraSystemName" : "",
		"path" : "",
		"deleteRawMetadataAfterProcessing": false|true
	} ]
}

Properties	Description
general	Describes the connection between Collibra Data Lineage and Data Catalog.
catalog	Contains information that is necessary to connect to Data Catalog.
url	The public URL of your Collibra environment. Other URLs are not accepted.
username	The username that you use to sign in Collibra.
useCollibraSystemName	Indicates whether you want to use the system or server name of a data source to match to the System asset you created when you prepared the physical data layer. The names are case-sensitive. This is useful when you have multiple databases with the same name. Specify one of the following values: `false` The lineage harvester does not stitch the system or server name of your data source to the System asset in Data Catalog. This is the default value. `true` The lineage harvester reads the system or server names that you specify for the system data object in the `tree` and `lineage` sections in your custom technical lineage JSON file and stitches the names to the System assets in Data Catalog. Only specify this value if you have multiple databases with the same name.
sources	Contains the information needed to retrieve a custom lineage. Use this section to specify the location of the JSON file that defines the custom technical lineage. If you want to create the technical lineage for multiple data sources, create a `sources` section for each data source.
type	The kind of data source. The value must be `ExternalDirectory`.
id	This property is used to identify the batch of harvested metadata on the Collibra Data Lineage service instance. The value can be anything as long as it is unique and human readable. For example, `MyCustomLineage`.
dirType	The type of external directory. The value must be `custom-lineage`.
collibraSystemName	The lineage harvester ignores this property for custom technical lineage. To use the system or server name of your data source to match the System asset in Data Catalog, specify the system data object in the `tree` and `lineage` sections of your custom technical lineage JSON file.
path	The full path to the folder of the custom technical lineage JSON file, for example `C:\path\to\custom-lineage\dir`. There must be only one JSON file that defines the lineage, and the JSON file must be named lineage.json. You can, however, add other files in the harvested directory and subdirectories and refer to those files from within the JSON file.
deleteRawMetadataAfterProcessing	The lineage harvester harvests raw metadata from specified data sources and uploads it in a ZIP file to a Collibra Data Lineage service instance, for processing. You can use this optional property to specify whether or not the raw metadata should be deleted from Collibra Data Lineage service instance after the metadata that is targeted for ingestion in Data Catalog is processed. The default value is `false`. If the property is set to `true`, the raw source metadata is deleted after processing. If set to `false`, it is stored in the Collibra infrastructure. Note Setting this property to `true` can negatively impact performance.

Example

{
	"general" : {
		"catalog" : {
			"url" : "https://companydomain.collibra.com",
			"username" : "my-Collibra-username",
		},
		"useCollibraSystemName" : false
	},
	"sources" : [{
           "id": "MyCustomLineage",
			"type": "ExternalDirectory",
			"dirType": "custom-lineage",
			"path”: “/path/to/custom-lineage/dir/",
			"collibraSystemName": "MySystemName"
			}
			]
}

Run the lineage harvester.
Before you begin

If you use a proxy server, connect to the proxy server. For more information, go to Connecting to a proxy server.

Requirements and permissions
- Collibra Data Intelligence Platform.
- You have purchased Collibra Data Lineage.
- A global role with the following global permissions:
  - Catalog, for example Catalog Author
  - Data Stewardship Manager
  - Manage all resources
  - System administration
  - Technical lineage
- A resource role with the following resource permissions on the community level in which you created the domain:
  - Asset: add
  - Attribute: add
  - Domain: add
  - Attachment: add
Steps
1. Start the lineage harvester by entering the full-sync command.
  - To process data from all data sources in the configuration file:
    For windows:
    .\bin\lineage-harvester.bat full-sync
    For other operating systems:
    ./bin/lineage-harvester full-sync
  - To process data from specific data sources in the configuration file:
    For windows:
    .\bin\lineage-harvester.bat full-sync -s "ID of the data source"
    For other operating systems:
    ./bin/lineage-harvester full-sync -s "ID of the data source"
  Tip For more information and command options, go to Lineage harvesting app command options and arguments.
  Note
  If you have Snowflake data sources in your lineage harvester configuration file, set the JAVA_OPTS environment variable first. For example, to process data from all data sources including the Snowflake data sources, take the following steps:
  On Windows
  Enter one of the following commands:
  If you use OpenJDK 16:
  set JAVA_OPTS="-Djdk.module.illegalAccess=permit"
  If you use OpenJDK 17 or higher:
  set JAVA_OPTS="--add-opens=java.base/java.nio=ALL-UNNAMED"
  In the same command line, enter the following command:
  .\bin\lineage-harvester.bat full-sync
  Note The set command is specific to the Windows Command Shell. The command is different if you are using PowerShell.
  On Linux
  Enter the following command:
  If you use OpenJDK 16:
  JAVA_OPTS="-Djdk.module.illegalAccess=permit" ./bin/lineage-harvester full-sync
  If you use OpenJDK 17 or higher:
  JAVA_OPTS="--add-opens=java.base/java.nio=ALL-UNNAMED" ./bin/lineage-harvester full-sync
2. When prompted, enter the passwords to connect to Collibra and your data sources. Do one of the following:
  - Enter the passwords in the console.
    The passwords are encrypted and stored in /config/pwd.conf.
  - Provide the passwords via command line.
    The passwords are stored locally and not in your lineage harvester folder.
3. If you are creating technical lineage for dbt Cloud and prompted to enter your API token, enter the token value for the service token that you specified for the tokenName property in the lineage harvester configuration file for dbt Cloud.

What's next?

You can check the progress of the ingestion in Activities. The results field indicates how many relations were imported into Data Catalog.

After the metadata is ingested in Data Catalog, you can go to the domain that you specified in your lineage harvester configuration file and view the newly created assets. These assets are automatically stitched to existing assets in Data Catalog.

Warning We strongly recommend that you not edit the full names of any BI assets. Doing so will likely lead to errors during the synchronization process.

Warning We highly recommend that you do not move the ingested assets to a different domain. If you do, the assets will be deleted and recreated in the initial BI Catalog domain when you synchronize. As a consequence, any manually added data of those assets is lost.

Create a technical lineage via the lineage harvester

Requirements and permissions

Steps

Requirements and restrictions

Format

Example

Before you begin

Requirements and permissions

Steps

Requirements and restrictions

Format

Example

Before you begin

Requirements and permissions

Steps

What's next?