Prepare the lineage harvester configuration file
Before you can visualize the technical lineage or ingest a BI source, you have to create a configuration file for the (meta)data sources that you want to process. This configuration file is used by the lineage harvester to extract data from (meta)data sources for which you want to create a technical lineage or you want to ingest.
- Technical lineage only supports a limited list of (meta)data sources.
- You can only use UTF-8 or ISO-8859-1 characters in all lineage harvester files.
- Each data source has an ID property. The ID string must be unique and human readable. The ID can be anything and is only used to identify the batch of metadata that is processed on the Collibra Data Lineage server.
- The lineage harvester connects to different servers based on your geographical location and cloud provider. Make sure you have the correct system requirements before you run the lineage harvester. If your location or cloud provider changes, the lineage harvester rescans all your data sources.
- Technical lineage supports authentication by means of username and password, for all data sources, except for external directories. Google BigQuery data sources can also be authenticated via a service account key file. For more information, see the Google BigQuery documentation.
- The lineage harvester does not support proxy server authentication, but you can manually connect to a proxy server via command line. For more information, see technical lineage general troubleshooting.
- If you upgrade to lineage harvester 1.3.0 or newer, you have to follow an upgrade procedure.
Tip If you want to ingest and create a technical lineage for Looker or Power BI, we highly advise you to read the dedicated sections.
Prerequisites
- You have prepared the physical data layer in Data Catalog.
- You have a global role that has the System administration global permission.
- You have a global role that has the Manage all resources global permission.
- You have a global role with the Technical lineage global permission.
- You have downloaded the lineage harvester and you have the necessary system requirements to run it.
- You have installed Java Runtime Environment version 8 or newer.
- You have added Firewall rules so that the lineage harvester can connect to:
- All Collibra Data Lineage servers within your geographical location:
- 18.198.89.106 (techlin-aws-eu)
- 54.242.194.190 (techlin-aws-us)
- 15.222.200.199 (techlin-aws-ca)
- 35.205.146.124 (techlin-gcp-eu)
- 34.73.33.120 (techlin-gcp-us)
- 35.197.182.41 (techlin-gcp-au)
34.152.20.240 (techlin-gcp-ca)
- The host names of all databases in the lineage harvester configuration file.
- If you want to use a previously loaded data source, you have downloaded the SQL files of the data source to the lineage harvester.
- If you want to use an external directory, you have prepared a folder with data objects from the external directory.
- You have the necessary permissions to all database objects that the lineage harvester accesses.
Tip
Some data sources require specific permissions.
Data source type permissions:
You need read access on the SYS schema.
You need read access on the SYS schema and the View Definition Permission in your SQL Server.
You need read access on information_schema.
You need read access on information_schema. Only views that you own are processed.
You need a role that can access the snowflake shared read-only database.
To access the shared database, the account administrator must grant IMPORTED PRIVILEGES on the shared database to the user that runs the lineage harvester.
You need read access on the DBC.
You need read access to the following dictionary views:
- all_tab_cols
- all_col_comments
- all_objects
- ALL_DB_LINKS
- all_mviews
- all_source
- all_synonyms
- all_views
You need read access on information_schema.
You need Admin permission on all objects that you want to harvest.
You have added the Matillion certificate to a Java truststore.
You have at least a Matillion Enterprise license.
Steps
- Run the following command line to start the lineage harvester:
- Windows: .
\bin\lineage-harvester.bat - for other operating systems:
chmod +x bin/lineage-harvesterand thenbin/lineage-harvester
An empty configuration file is created in the lineage harvester config folder. - Windows: .
- Open the configuration file and enter the values for each property.
Tip
Use these options to filter the rows of the table to your needs.
Supported integrations:
Tip You can use the configuration file generator to create an example configuration file with the properties of your choosing. You can easily copy this example to your configuration file and replace the values of the properties to match your data source information.
Properties
Description general This section describes the connection between Collibra lineage and Data Catalog.
catalogThis section contains information that is necessary to connect to Data Catalog.
Note Versions of the lineage harvester older than 1.1.2 show
collibrainstead ofcatalog.urlThe URL of your Collibra environment.
Note You can only enter the public URL of your Collibra environment. Other URLs will not be accepted.
usernameThe username that you use to sign in to Collibra.
useCollibraSystemNameIndication whether you want to use the system or server name of a data source to match to the System asset you created when you prepared the physical data layer. This is useful when you have multiple databases with the same name.
By default, the
useCollibraSystemNameproperty is set toFalse. If you want to use it, set it toTrue.- If you keep the
useCollibraSystemNameproperty set tofalse, the lineage harvester ignores thecollibraSystemNameproperty in the rest of the configuration file. - If you set the
useCollibraSystemNameproperty totrue, the lineage harvester reads the value in thecollibraSystemNameproperty in all sections of the configuration file and in the following files:- The Informatica <source ID> configuration fileImportant You must prepare a <source ID> configuration file regardless of whether the
useCollibraSystemNameproperty in your lineage harvester configuration files is set totrueorfalse. - The IBM DataStage or SQL Server Integration Services connection definition configuration files.
- The Informatica Intelligent Cloud Services <source ID> configuration file.Important You must prepare a <source ID> configuration file regardless of whether the
useCollibraSystemNameproperty in your lineage harvester configuration files is set totrueorfalse. - The Power BI <source ID> configuration file.
- The Looker <source ID> configuration file.
- The JSON files with a predefined lineage.
- The Informatica <source ID> configuration file
Note For SQL data sources, if theuseCollibraSystemNameproperty is:false, system or server names in table references in analyzed SQL code are ignored. This means that a table that exists in two different systems or servers is identified (either correctly or incorrectly) as a single data object, with a single asset full name.true, system or server names in table references are considered to be represented by different System assets in Data Catalog. The value of thecollibraSystemNameproperty is used as the default system or server name.
Warning Unless you have multiple databases with the same name, we highly recommend that you don't change the default value.
sources This section describes the data sources for which you want to create the technical lineage. You have to create a configuration section for each data source.
Note You can add multiple data sources to the same configuration file.
<SQL directory properties>This configuration section contains the required information of one individual SQL directory with connection type "Folder".
idThe unique ID of the data source. For example,
my_first_data_source.typeThe kind of data source. In this case, the value has to be
SqlDirectory.pathThe full path to the SQL directory.
maskThe pattern of the file names in the directory. By default, this is
*.recursiveIndication of the files you want to harvest:
false(default): Only harvest the files in directly under the folder in the SQL directory path.true: Harvest all files under the folder in the SQL directory path and subdirectories.
dialectThe dialect of the database. See the list of allowed values.You can enter one of the following values:
- azure, for an Azure SQL Server data source.
- bigquery, for a Google BigQuery data source.
- db2, for an IBM DB2 data source.
- hana, for a SAP Hana data source.
- hana-cviews, for SAP Hana data calculation views.
- hive, for a HiveQL data source.
- greenplum, for a Greenplum data source.
- mssql, for a Microsoft SQL Server data source.
- mysql, for a MySQL data source.
- netezza, for a Netezza data source.
- oracle, for an Oracle data source.
- postgres, for a PostgreSQL data source.
- redshift, for an Amazon Redshift data source.
- snowflake, for a Snowflake data source.
- spark, for a Spark SQL data source.
- sybase, for a Sybase data source.
- teradata, for a Teradata data source.
databaseThe name of your database, which is the full name of your Database asset.
Note You have to use the same database name as the full name of the Database asset that you create when you prepare the physical data layer in Data Catalog.ImportantTeradata and MySQL data sources do not have schemas. As a result, Teradata and MySQL databases are stored in Data Catalog and technical lineage as Schema assets. The technical lineage Browse tab pane shows the following names:
-
For Teradata:
- The database name is the name that you enter in the
collibraSystemNameproperty. - The schema name is the name that you enter in the
databaseproperty.
- The database name is the name that you enter in the
- For MySQL:
- The database name is the name that you enter in the
databaseproperty.
- The database name is the name that you enter in the
collibraSystemNameThe name of the data source's system or server. This is also the full name of your System asset in Data Catalog.
You must use the same system name as the full name of the System asset that you create when you prepare the physical data layer in Data Catalog. If you don't prepare the physical data layer, Collibra Data Lineage cannot stitch the data objects in your technical lineage to the assets in Data Catalog.
schemaThe name of the default schema, if not specified in the data source itself. This corresponds to name of your Schema asset.
Note You must use the same schema name as the name of the Schema asset that you create when you prepare the physical data layer in Data Catalog.verboseIndication whether you want to enable verbose logging.
By default this is set to
True. If you don't want to use verbose logging, set it toFalse.<External directories>This configuration section contains the required information to connect to the following data sources:
- Informatica PowerCenter
- SQL Server Integration Services (SSIS).
- IBM InsfoSphere DataStage
Note Make sure that you have prepared a local folder with the Informatica objects, SSIS files or DataStage files for which you want to create a technical lineage.
collibraSystemNameThe name of the data source's system or server. If the
useCollibraSystemNameproperty is set totrue, you must prepare a configuration file to provide the system information.idThe unique ID of your data source. For example, my_informatica.
typeThe kind of data source. In this case, the value has to be ExternalDirectory.
dirTypeThe type of external directory. The value has to be one of the following:
infa, for an Informatica PowerCenter data source.ssis, for a SQL Server Integration Service data source.datastage, for a IBM InfoSphere DataStage source.
pathThe full path to the folder where you stored the data source.
maskThe pattern of the file names in the directory. By default, this is
*.recursiveIndication whether you want to use recursive queries.
By default, this is set to
False. If you want to use recursive query, set it toTrue.<Informatica Intelligent Cloud Services Data Integration>This configuration section contains the required information to enable the lineage harvester to collect and process Data Integration objects.
Tip Make sure you have READ permission on all data objects that you want to harvest.
typeThe kind of data source. In this case, the value has to be
IICS.idThe unique ID that is used to identify the data source on the Collibra Data Lineage server. For example,
my_data_integration.collibraSystemNameThe name of the Informatica server or system.
Important You must prepare a <source ID> configuration file to provide this system information. This is true regardless of whether the
useCollibraSystemNameproperty is set to true or false.loginURLThe URL of the Informatica Intelligent Cloud Services environment sign-in page. For example:
https://dm-us.informaticaintelligentcloud.com.usernameThe username you use to sign in to Informatica Intelligent Cloud Services.
objectsThe objects that you want to export. Each object requires a path and a type, for example:
"objects": [ { "path" : "Sales", "type" : "Project" }, { "path" : "Finance/Task_Flows", "type" : "Folder" }, { "path" : "Common/Task_Flows/tf_CalendarDimension", "type" : "Taskflow" } ]The following section provides information to identify and access Data Integration objects.
Tip For more information about the objects that you can export and the required information, see the Informatica documentation.
pathThe full path to the object.
typeThe type of the object. For example, Taskflow.
IICS scanner's starting point is a Taskflow. Therefore the only meaningful types to export are: Taskflow, Project and Folder.
Note The types are not case sensitive.
paramFilesThe full path to the directory in which your parameter files are stored.
This is an optional parameter that allows you to harvest parameter files in Informatica Intelligent Cloud Services data sources.
Important The hierarchy of the files in the directory must be an exact match of the hierarchy of the files in your file system.
<Matillion>
This section contains the required information for Matillion.
Tip When you create a new project in Matillion, you define in which group you want to create the project, the project name and the environment name. This information is needed to enable the lineage harvester to access Matillion and scan your metadata.
Important Currently, you can only create a technical lineage for Snowflake and Redshift projects in Matillion.
idThe unique ID that is used to identify the data source on the Collibra Data Lineage server. For example,
my_matillion_data_integration.typeThe kind of data source. In this case, the value has to be
Matillion.urlThe URL of your Matillion environment. For example,
https://<domain name>orhttps://<IP address>.groupNameThe name of your group in Matillion.
projectNameThe name of your project in Matillion.
You can only add the name of one project. If you want to create a technical lineage for other projects within the same group, create a new section in the lineage harvester configuration file.
environmentNameThe name of your environment in Matillion.
You can only add the name of one environment. If you want to create a technical lineage for other environments within the same project, create a new section in the lineage harvester configuration file.
dialectThe dialect of the database.
You can enter one of the following values:
redshift, for an Amazon Redshift data source.snowflake, for a Snowflake data source.
usernameThe username that you use to sign in to Matillion.
startTimestampThe timestamp of tasks in Matillion. You can use this parameter to limit the amount of metadata that the lineage harvester scans.
If the
startTimestampfield remains empty or is deleted from the configuration file, all accessible tasks are scanned.collibraSystemNameThe name of the Matillion system or server.
<Custom lineage>This section contains the required information to connect to a custom lineage. You create a custom lineage by adding connection properties to a JSON file containing a predefined technical lineage.
Make sure that you have prepared a local folder with the JSON file that contains the predefined technical lineage.
Note In the local folder that you need to create, you can only have one JSON file. You can, however, add other files in the harvested directory and subdirectories and refer to those files from within the JSON file.
idThe unique ID of your custom technical lineage. For example,
MyCustomLineage.typeThe kind of data source. In this case, the value has to be
ExternalDirectory.dirTypeThe type of external directory. In this case, the value is
custom-lineage.pathThe full path to the folder where you stored the data source or JSON file.
<database properties>This configuration section contains the required information of one individual data source with connection type "JDBC".
idThe unique ID of your data source. For example,
my_second_data_source.typeThe kind of data source. In this case, the value has to be
Database.usernameThe username that you use to sign in to your data source.
dialectThe dialect of the database.
See the list of allowed values.You can enter one of the following values:
- azure, for an Azure SQL Server data source.
- db2, for an IBM DB2 data source.
- hana, for a SAP Hana data source.
- hana-cviews, for SAP Hana data calculation views.
- greenplum, for a Greenplum data source.
- mssql, for a Microsoft SQL Server data source.
- mysql, for a MySQL data source.
- netezza, for a Netezza data source.
- oracle, for an Oracle data source.
- postgres, for a PostgreSQL data source.
- redshift, for an Amazon Redshift data source.
- spark, for a Spark SQL data source.
- sybase, for a Sybase data source.
- teradata, for a Teradata data source.
If you want to use a Spark SQL data source, make sure that you have an AWS host.
databaseNamesThe names or IDs of your databases.
Enter the database names of your data source between double quotes ("") and put everything between square brackets. If you want to include more than one database, separate them by a comma. For example, ["
MyFirstDatabase", "MySecondDatabase"].Note You have to use the same database names as the full names of the Database assets that you create when you prepare the physical data layer in Data Catalog.ImportantTeradata and MySQL data sources do not have schemas. As a result, Teradata and MySQL databases are stored in Data Catalog and technical lineage as Schema assets. The technical lineageBrowse tab pane shows the following names:
-
For Teradata:
- The database name is the name that you enter in the
collibraSystemNameproperty. - The schema name is the name that you enter in the
databaseNamesproperty.
- The database name is the name that you enter in the
- For MySQL:
- The database name is the name that you enter in the
databaseNamesproperty.
- The database name is the name that you enter in the
connectAsServiceNameThe option to determine whether your Oracle database uses an Oracle service name or SID.
True: Connect to an Oracle database that uses an Oracle service name. Enter the service name in thedatabaseNamesproperty.False: Connect to an Oracle database that uses an SID. Enter the SID in thedatabaseNamesproperty.
Note This property is only valid for Oracle databases. It will be ignored for all other databases.
hostnameThe name of your database host.
collibraSystemNameThe name of the data source's system or server. This is also the full name of your System asset in Data Catalog.
You must use the same system name as the full name of the System asset that you create when you prepare the physical data layer in Data Catalog. If you don't prepare the physical data layer, Collibra Data Lineage cannot stitch the data objects in your technical lineage to the assets in Data Catalog.
If the
useCollibraSystemNameproperty is:false(default), system or server names in table references in analyzed SQL code are ignored. This means that a table that exists in two different systems or servers is identified (either correctly or incorrectly) as a single data object, with a single asset full name.true, system or server names in table references are considered to be represented by different System assets in Data Catalog. The value of thecollibraSystemNamefield is used as the default system or server name.
portThe port number.
customConnectionPropertiesAn option to enable the lineage harvester to read additional connection parameters. This parameter is only required in very specific situations. If you don't need it, you can remove it from the configuration file.
Note You can currently only use this property for the following data sources:
- HiveQL
- IBM DB2
- Netezza
- PostgreSQL
- Redshift
- SAP Hana
- Snowflake
- Spark SQL
- Sybase
<Google BigQuery database>This configuration section contains the required information for a Google BigQuery database.
idThe unique ID of your data source. For example,
my_third_data_source.typeThe kind of data source. In this case, the value has to be
DatabaseBigQuery.projectIDsThe IDs of your Google BigQuery project. You can add multiple projects. For example,
[ "first-project", "second-project", "third-project" ].Note You have to use the same project ID as the full name of the Database asset that you create when you prepare the physical data layer in Data Catalog.regionThe location of your BigQuery data. This is the region that you specified when you create a data set.
You can only add one location as value. However, you can create separate BigQuery entries per location in the configuration file. As a result, you create a complete technical lineage with Google BigQuery data from different locations.
Note This property is optional.
authThe path to a JSON file that contains authentication information.
Tip For more information about setting up the authentication, see the Google Big Query user guide.
collibraSystemNameThe name of the Google BigQuery system. This is also the full name of your System asset in Data Catalog.
You must use the same system name as the full name of the System asset that you create when you prepare the physical data layer in Data Catalog. If you don't prepare the physical data layer, Collibra Data Lineage cannot stitch the data objects in your technical lineage to the assets in Data Catalog.<Snowflake database>This configuration section contains the required information for a Snowflake database.
idThe unique ID of your data source. For example,
my_fourth_data_source.typeThe kind of data source. In this case, the value has to be
DatabaseSnowflake.usernameThe username that you use to sign in to your data source.
hostnameThe URL that you use to access Snowflake web console. For example,
<AccountName>.snowflakecomputing.com.collibraSystemNameThe name of the Snowflake system. This is also the full name of your System asset in Data Catalog.
You must use the same system name as the full name of the System asset that you create when you prepare the physical data layer in Data Catalog. If you don't prepare the physical data layer, Collibra Data Lineage cannot stitch the data objects in your technical lineage to the assets in Data Catalog.databaseNamesThe names of your databases.
Enter the database names of your data source between double quotes ("") and put everything between square brackets. If you want to include more than one database, separate them by a comma. For example, ["
MyFirstSnowflakeDatabase", "MySecondSnowflakeDatabase"]Note You have to use the same database names as the full names of the Database assets that you create when you prepare the physical data layer in Data Catalog.warehouseThe name of your virtual warehouse.
Note This property is optional.
customConnectionProperties
An option to enable the lineage harvester to read additional connection parameters. This parameter is only required in very specific situations. If you don't need it, you can remove it from the configuration file.
Example If you get an OSCP scan error, you can turn OSCP checking off by using the following value:
insecureMode=true.<SQL files in the lineage harvester output folder>This configuration section contains the required information for SQL files of a data source that were previously downloaded by the lineage harvester and is stored in the lineage harvester output folder.
typeThe kind of data source. In this case, the value has to be
LoadedSource.idThe unique ID of the data source that you uploaded to the lineage harvester folder. For example,
my_loaded_snowflake_source.zipFileThe full path to the ZIP file that was created in the lineage harvester folder.
<Power BI>This configuration section contains the required information for Power BI integration.
Note You have to purchase the Power BI connector and lineage feature. Then you need to add the Power BI connection properties to both the lineage harvester configuration file and the Power BI harvester configuration file to ingest Power BI metadata into Data Catalog.
typeThe kind of data source. In this case, the value has to be
ExistingLineage.idThe unique ID of the Power BI metadata you harvested via the Power BI harvester.
You must use the same ID as the value you used in the Power BI configuration file sourceID property.
<Looker>This configuration section contains the required information for Looker integration.
collibraSystemNameThe name of the Looker system or server. If the
useCollibraSystemNameproperty is set totrue, you must prepare a configuration file to provide the system information.idThe unique ID of your Looker metadata. For example, my_looker.
Tip This value can be anything as long as it is unique and human readable. The ID identifies the batch of Looker metadata on the Collibra Data Lineage server.
typeThe kind of data source. In this case, the value has to be Looker.
lookerUrlThe URL to your Looker API.
Tip There are two ways to find the Looker API URL:- In the API Host URL field in the Looker Admin menu. If this field is empty, you can use the default Looker API URL which you can find in the interactive API documentation.
- In the interactive API documentation URL. It is the part of the URL before
/api-docs/.
clientIdThe username you use to access the Looker API.
domainIdThe unique ID of the domain in Collibra Data Intelligence Cloud in which you want to ingest the Looker assets.
- If you keep the
- Save the configuration file.
- Start the lineage harvester again and do one of the following:
- To process data from all data sources in the configuration file, run the following command:
For windows:
.\bin\lineage-harvester.bat full-sync
For other operating systems:./bin/lineage-harvester full-sync
- To process data from specific data sources in the configuration file, run the following command:
For windows:
.\bin\lineage-harvester.bat full-sync -s "ID of the data source"
For other operating systems:./bin/lineage-harvester full-sync -s "ID of the data source"
The lineage harvester sends the data source information to a Collibra Data Lineage server using Collibra REST API, where it is parsed and analyzed. As a result, the technical lineage is created and shown in Data Catalog. - To process data from all data sources in the configuration file, run the following command:
- When prompted, enter the passwords to connect to Collibra and your data sources. Do one of the following:
-
Enter the passwords in the console.The passwords are encrypted and stored in /config/pwd.conf.
- Provide the passwords via command line.The passwords are stored locally and not in your lineage harvester folder.
-
Enter the passwords in the console.
Tip If the lineage harvester log shows an error message or the harvesting process fails, you can use the technical lineage troubleshooting guide to fix your issue.
What's next?
If you prepared the physical data layer and have the required permissions, you can go to the asset page of a Table, Column Power BI Column or Looker Look asset from the data source that you added in the configuration file and visualize the technical lineage. The technical lineage shows the data source information of data sources that have been successfully analyzed and processed.
The lineage harvester can also use scheduled jobs to synchronize the data sources on fixed times.
Tip You can check the progress of the technical lineage creation in Activities. The Results field indicates how many relations were imported into Data Catalog. Go to the status page to see the log files of the SQL analysis.