Prepare the lineage harvester configuration file

Before you can visualize the technical lineage, you have to create a configuration file for the (meta)data sources that you want to process. This configuration file is used by the lineage harvester to extract data from (meta)data sources for which you want to create a technical lineage or you want to ingest.

If you use multiple lineage harvesters on different servers, you can create a separate configuration file for the lineage harvester on each server and configure different data sources in each configuration file. Note that multiple configuration files can only be used for testing purposes.

Note

Technical lineage supports a limited list of (meta)data sources.
In all lineage harvester files, you must use UTF-8 or ISO-8859-1 characters, with the exception of SQL files, which can only be UTF-8 encoded.
Each data source has an ID property. The ID string must be unique and human readable. The ID can be anything and is only used to identify the batch of metadata that is processed on the Collibra Data Lineage service.
The lineage harvester connects to different Collibra Data Lineage service instances based on your geographical location and cloud provider. Make sure you have the correct system requirements before you run the lineage harvester. If your location or cloud provider changes, the lineage harvester rescans all your data sources.
Technical lineage supports the following means of authentication:
- For all data sources, except for external directories: username and password.
- Google BigQuery data sources: username and password or a service account key file. For more information, see the Google BigQuery documentation.
- Snowflake: username and password or key pair authentication.
- No other authentication methods are supported.
Comments in the lineage harvester configuration file are not supported.

Tip

For information on ingesting metadata from the following BI tools and creating a technical lineage via the lineage harvester, see the dedicated sections:

For information on creating custom technical lineage by using the lineage harvester, go to Working with custom technical lineage.

Before you begin

Set up the latest lineage harvester.
Prepare the Data Catalog physical data layer for technical lineage.
If you want to use a previously loaded data source, download the SQL files of the data source to the lineage harvester.
If you want to use an external directory, prepare a folder with data objects from the external directory.

Requirements and permissions

A global role with the following global permissions:
- Catalog, for example Catalog Author
- Data Stewardship Manager
- Manage all resources
- System administration
- Technical lineage
A resource role with the following resource permissions on the community level in which you created the BI Data Catalog domain:
- Asset: add
- Attribute: add
- Domain: add
- Attachment: add

Necessary permissions to all database objects that the lineage harvester accesses.

Tip

Some data sources require specific permissions.

Ensure that you meet the Azure Data Factory prerequisites.

You need read access on the SYS schema.

You need read access on the SYS schema and the View Definition Permission in your SQL Server.

You need read access on information_schema:

bigquery.datasets.get
bigquery.tables.get
bigquery.tables.list
bigquery.jobs.create
bigquery.routines.get
bigquery.routines.list

GRANT SELECT, at table level. Grant this to every table for which you want to create a technical lineage.

You need read access on information_schema. Only views that you own are processed.

SELECT, at table level. Grant this to every table for which you want to create a technical lineage.

The role of the user that you specify in the username property in lineage harvester configuration file must be the owner of the views in PostgreSQL.

A role with the LOGIN option.

SELECT WITH GRANT OPTION, at Table level.

CONNECT ON DATABASE

Note The following permissions are the same, regardless of the ingestion mode: SQL or SQL-API.

You need a role that can access the Snowflake shared read-only database. To access the shared database, the account administrator must grant the IMPORTED PRIVILEGES privilege on the shared database to the user that runs the lineage harvester.

Tip If the default role in Snowflake does not have the IMPORTED PRIVILEGES privilege, you can use the customConnectionProperties property in the lineage harvester configuration file to assign the appropriate role to the user. For example:
"customConnectionProperties": "role=METADATA"

You need read access on the DBC.

You need read access to the following dictionary views:

all_tab_cols
all_col_comments
all_objects
ALL_DB_LINKS
all_mviews
all_source
all_synonyms
all_views

You need read access on definition_schema.

Your user role must have privileges to export assets.
You must have read permission on all assets that you want to export.

You have added the Matillion certificate to a Java truststore.
You have at least a Matillion Enterprise license.

Steps

Start the lineage harvester to create an empty lineage harvester configuration file by entering the following command:
- Windows: .\bin\lineage-harvester.bat
- For other operating systems: chmod +x bin/lineage-harvester and then bin/lineage-harvester
An empty configuration file is created in the config folder.
Optional: If you use more than one lineage harvester on different servers, repeat Step 1 to create an empty lineage harvester configuration file for each lineage harvester on each server.
Note Use multiple configuration files for lineage harvesters on different servers only for testing purposes.

Open the configuration file and enter the values for each property. If you create multiple configuration files, ensure that each configuration file contains configurations of unique data sources. If the configuration of a data source duplicates in different configuration files, only one configuration takes effect.

Tip Use the configuration file generator to create an example configuration file with the properties of your choosing. You can copy this example to your configuration file and replace the values of the properties to match your data source information.

Properties	Description
general	This section describes the connection between Collibra lineage and Data Catalog.
techlin	This section contains information that is necessary to connect to the Collibra Data Lineage service instance. Warning This section applies only to US government customers.
url	The URL of the Collibra Data Lineage service instance. Example “url”: “https://techlin-gov.collibra.com” Warning This section applies only to US government customers.
userKey	The unique API key to connect to the Collibra Data Lineage service instance. A unique user key is needed for each Collibra environment. If you're not sure what your user key is, please contact your Collibra Customer Success Manager. Warning This section applies only to US government customers.
catalog	This section contains information that is necessary to connect to Data Catalog. Note Versions of the lineage harvester older than 1.1.2 show `collibra` instead of `catalog`.
url	The URL of your Collibra environment. Note Enter the public URL of your Collibra environment. Other URLs are not accepted.
username	The username that you use to sign in to Collibra.
useCollibraSystemName	Indicates whether you want to use the system or server name of a data source to match to the System asset you created when you prepared the physical data layer. The names are case-sensitive. This is useful when you have multiple databases with the same name. By default, the `useCollibraSystemName` property is set to `false`. If you keep the `useCollibraSystemName` property set to `false`, the lineage harvester ignores the `collibraSystemName` property in the rest of the configuration file. If you set the `useCollibraSystemName` property to `true`, the lineage harvester reads the value in the `collibraSystemName` property in all sections of the configuration file. It also reads the `collibraSystemName` property in the following files: The Informatica <source ID> configuration file Important You must prepare a <source ID> configuration file regardless of whether the `useCollibraSystemName` property in your lineage harvester configuration files is set to `true` or `false`. The IBM DataStage or SQL Server Integration Services connection definition configuration files. The Informatica Intelligent Cloud Services <source ID> configuration file. Important You must prepare a <source ID> configuration file regardless of whether the `useCollibraSystemName` property in your lineage harvester configuration files is set to `true` or `false`. Note For SQL data sources, if this property is: `false`, system or server names in table references in analyzed SQL code are ignored. This means that a table that exists in two different systems or servers is identified (either correctly or incorrectly) as a single data object, with a single asset full name. `true`, system or server names in table references are considered to be represented by different System assets in Data Catalog. The value of the `collibraSystemName` property is used as the default system or server name. Indicates whether or not you intend to use a Matillion <source ID> configuration file to specify the system name of a data source. This is useful if you have multiple databases with the same name, or if you want to group a number of databases under one system. By default, this property is set to `false`. If you set this property to `true`, you must prepare a Matillion <source ID> configuration file. Indicates whether or not you want to use the system or server name of a JDBC data source to match the System asset that you created when you prepared the physical data layer. The names are case-sensitive. Specify one of the following values: `false` The lineage harvester ignores all system or server names that you specify on the `collibraSystemName` properties in the configuration file. This is the default value. `true` The lineage harvester reads the system and server names that you specify on the `collibraSystemName` properties in all sections of the configuration file. Only specify this value when you have multiple databases with the same name. Note Specify this property with the value of `true` only when you have multiple databases with the same name.
sources	This section describes the data sources for which you want to create the technical lineage. You have to create a configuration section for each data source. Note You can add multiple data sources to the same configuration file.
<SQL directory properties>	This configuration section contains the required information of one individual SQL directory with connection type "Folder".
id	The unique ID of the data source. For example, `my_first_data_source`.
type	The kind of data source. In this case, the value has to be `SqlDirectory`.
path	The full path to the folder where you added SQL files, for example, `C:\path\to\config\dir`.
mask	The pattern of the file names in the directory. By default, this is `*`.
recursive	Indication of the files you want to harvest: `false` (default): Only harvest the files in directly under the folder in the SQL directory path. `true`: Harvest all files under the folder in the SQL directory path and subdirectories.
dialect	The dialect of the database. See the list of allowed values. You can enter one of the following values: azure, for an Azure SQL Server data source. bigquery, for a Google BigQuery data source. db2, for an IBM DB2 data source. hana, for a SAP HANA data source. hana-cviews, for a SAP HANA data source. Important The `hana-cviews` dialect is supported for SAP HANA (on-premises). It is not supported for SAP HANA Cloud. hive, for a HiveQL data source. greenplum, for a Greenplum data source. mssql, for a Microsoft SQL Server data source. mysql, for a MySQL data source. netezza, for a Netezza data source. oracle, for an Oracle data source. postgres, for a PostgreSQL data source. redshift, for an Amazon Redshift data source. snowflake, for a Snowflake data source. spark, for a Spark SQL data source. sybase, for a Sybase data source. teradata, for a Teradata data source.
database	The name of your database, which is the full name of your Database asset. Note You have to use the same database name as the full name of the Database The names are case-sensitive. asset that you create when you prepare the physical data layer in Data Catalog. The database and schema names in the SQL statements in your SQL files take precedence over the values that you provide for the `database` and `schema` properties in the lineage harvester configuration file. If your SQL statements contain database and schema names, Collibra Data Lineage uses them for stitching. If your SQL statements do not contain database and schema names, Collibra Data Lineage uses the values of the `database` and `schema` properties in the configuration file for stitching.. For more information, go to Prepare the SQL directory and Automatic stitching for technical lineage. Important HiveQL, MySQL and Teradata data sources don't have schemas. Therefore, HiveQL, MySQL and Teradata databases are stored in Data Catalog and technical lineage as Schema assets. The technical lineage Browse tab pane shows the following names: For HiveQL and Teradata: The database name is the name that you enter for the `collibraSystemName` property. The schema name is the name that you enter for the `database` property. For MySQL: The database name is the name that you enter for the `database` property.
collibraSystemName	The name of the data source's system or server. This is also the full name of your System asset in Data Catalog. Specify this property with the same name as the full name of the System asset that you create when you prepare the physical data layer in Data Catalog. If you don't prepare the physical data layer, Collibra Data Lineage cannot stitch the data objects in your technical lineage to the assets in Data Catalog.Specify this property with the same name as the full name of the System asset that you created when you registered the data source.
schema	The name of the default schema, if not specified in the data source itself. This corresponds to name of your Schema asset. Note You must use the same schema name as the name of the Schema asset that you create when you prepare the physical data layer in Data Catalog.
verbose	Indication whether you want to enable verbose logging. By default this is set to `True`. If you don't want to use verbose logging, set it to `False`.
deleteRawMetadataAfterProcessing	The lineage harvester harvests raw metadata from specified data sources and uploads it in a ZIP file to a Collibra Data Lineage service instance, for processing. You can use this optional property to specify whether or not the raw metadata should be deleted from Collibra Data Lineage service instance after the metadata that is targeted for ingestion in Data Catalog is processed. The default value is `false`. If the property is set to `true`, the raw source metadata is deleted after processing. If set to `false`, it is stored in the Collibra infrastructure. Note Setting this property to `true` can negatively impact performance.
<External directories>	This configuration section contains the required information to connect to the following data sources: Informatica PowerCenter SQL Server Integration Services (SSIS). IBM InsfoSphere DataStage Note Make sure that you have prepared a local folder with the Informatica objects, SSIS files or DataStage files for which you want to create a technical lineage.
collibraSystemName	The name of the data source's system or server. If the `useCollibraSystemName` property is set to `true`, you must prepare a configuration file to provide the system information.
id	The unique ID of your data source. For example, my_informatica.
type	The kind of data source. In this case, the value has to be ExternalDirectory.
dirType	The type of external directory. The value has to be one of the following: `infa`, for an Informatica PowerCenter data source. `ssis`, for a SQL Server Integration Service data source. `datastage`, for a IBM InfoSphere DataStage source.
path	The full path to the folder where you stored the data source, for example, `C:\path\to\config\dir`.
mask	The pattern of the file names in the directory. By default, this is `*`.
recursive	Indication whether you want to use recursive queries. By default, this is set to `False`. If you want to use recursive query, set it to `True`.
deleteRawMetadataAfterProcessing	The lineage harvester harvests raw metadata from specified data sources and uploads it in a ZIP file to a Collibra Data Lineage service instance, for processing. You can use this optional property to specify whether or not the raw metadata should be deleted from Collibra Data Lineage service instance after the metadata that is targeted for ingestion in Data Catalog is processed. The default value is `false`. If the property is set to `true`, the raw source metadata is deleted after processing. If set to `false`, it is stored in the Collibra infrastructure. Note Setting this property to `true` can negatively impact performance.
id	The unique ID that identifies the data source on a Collibra Data Lineage service instance, for example, my_adf.
type	The type of data source. The value must be AzureDataFactory.
collibraSystemName	The system or server name of the data source. This property is optional. Use this property with the `useCollibraSystemName` property to override the default Collibra System asset name for this data source. Specify this property with the same name as the full name of the System asset that you create when you prepare the physical data layer in Data Catalog. If you don't prepare the physical data layer, Collibra Data Lineage cannot stitch the data objects in your technical lineage to the assets in Data Catalog.
tenantDomain	The directory ID of the Azure Data Factory instance.
loginFlow	This section contains the login application information.
applicationId	The application ID of the Azure Data Factory instance.
type	The identity of the application. The value has to be ServicePrincipal.
resourceGroupName	The name of the resource group with the Reader role for the Azure Data Factory instance.
subscriptionId	The subscription ID of the resource group.
deleteRawMetadataAfterProcessing	The lineage harvester harvests raw metadata from specified data sources and uploads it in a ZIP file to a Collibra Data Lineage service instance, for processing. You can use this optional property to specify whether or not the raw metadata should be deleted from Collibra Data Lineage service instance after the metadata that is targeted for ingestion in Data Catalog is processed. The default value is `false`. If the property is set to `true`, the raw source metadata is deleted after processing. If set to `false`, it is stored in the Collibra infrastructure. Note Setting this property to `true` can negatively impact performance.
<Informatica Intelligent Cloud Services Data Integration>	This configuration section contains the required information to enable the lineage harvester to collect and process Data Integration objects. You can create different Informatica Intelligent Cloud Services <source ID> configuration files for a large data source to avoid errors that might occur when the lineage harvester ingests metadata from one source with a large size. You can then decrease the size of the source by separating the projects to a different source with a different <source ID> configuration file name. Show me the example "sources" : [ { "type" : "IICS", "id" : "iics_source-1", "collibraSystemName" : "iics-development", "loginUrl" : "https://dm-us.informaticaintelligentcloud.com", "username" : "login-iics" "objects" : [ { "path" : "Default/Sales", "type" : "Project" }, { "path" : "My Project/Statistics", "type" : "Project" } ] } { "type" : "IICS", "id" : "iics_source-2", "collibraSystemName" : "iics-development", "loginUrl" : "https://dm-us.informaticaintelligentcloud.com", "username" : "login-iics" "objects" : [ { "path" : "Finance/Task_Flows", "type" : "Folder" }, { "path" : "Common/Task_Flows/tf_CalendarDimension", "type" : "Taskflow" } ] } ] Tip Make sure you have READ permission on all data objects that you want to harvest.
type	The kind of data source. In this case, the value has to be `IICS`.
id	The unique ID that is used to identify the data source on the Collibra Data Lineage service. For example, `my_data_integration`.
collibraSystemName	The name of the Informatica server or system. Important You must prepare a <source ID> configuration file to provide this system information. This is true regardless of whether the `useCollibraSystemName` property is set to true or false.
loginURL	The URL of the Informatica Intelligent Cloud Services environment sign-in page. For example: `https://dm-us.informaticaintelligentcloud.com`.
username	The username you use to sign in to Informatica Intelligent Cloud Services.
objects	The objects that you want to export. Each object requires a path and a type, for example: "objects": [ { "path" : "Sales", "type" : "Project" }, { "path" : "Finance/Task_Flows", "type" : "Folder" }, { "path" : "Common/Task_Flows/tf_CalendarDimension", "type" : "Taskflow" } ] The following section provides information to identify and access Data Integration objects. Tip For more information about the objects that you can export and the required information, see the Informatica documentation.
path	The full path to the object, for example, `C:\path\to\object-dir`.
type	The type of the object. For example, Taskflow. IICS scanner's starting point is a Taskflow. Therefore the only meaningful types to export are: Taskflow, Project and Folder. Note The types are not case sensitive.
paramFiles	The full path to the directory in which your parameter files are stored. This is an optional parameter that allows you to harvest parameter files in Informatica Intelligent Cloud Services data sources. Important The hierarchy of the files in the directory must be an exact match of the hierarchy of the files in your file system. Show me how to do this Create a directory for your parameter files. For this example, let's name the directory my-parameter-files. In your lineage harvester configuration file, the value of the `paramFiles` property needs to be the full path to your parameter files directory, for example `/full/path/<my-parameter-files>/`. Copy your parameter files to your parameter files directory. Be sure to preserve the full path for each of your parameter files. For example, for parameter file /root/child/child2/paramfile.txt, run the following commands: `cd /full/path/<my-parameter-files>/` `mkdir -p root/child/child2/` `cp /root/child/child2/paramfile.txt root/child/child2/`
deleteRawMetadataAfterProcessing	The lineage harvester harvests raw metadata from specified data sources and uploads it in a ZIP file to a Collibra Data Lineage service instance, for processing. You can use this optional property to specify whether or not the raw metadata should be deleted from Collibra Data Lineage service instance after the metadata that is targeted for ingestion in Data Catalog is processed. The default value is `false`. If the property is set to `true`, the raw source metadata is deleted after processing. If set to `false`, it is stored in the Collibra infrastructure. Note Setting this property to `true` can negatively impact performance.
<Matillion>	This section contains the required information for Matillion. Tip When you create a new project in Matillion, you define in which group you want to create the project, the project name and the environment name. This information is needed to enable the lineage harvester to access Matillion and scan your metadata. Important Currently, you can only create a technical lineage for Snowflake and Redshift projects in Matillion.
id	The unique ID that is used to identify the data source on the Collibra Data Lineage service instance. For example, `my_matillion_data_integration`.
type	The kind of data source. In this case, the value has to be `Matillion`.
url	The URL of your Matillion environment. For example, `https://<domain name>` or `https://<IP address>`.
groupName	The name of your group in Matillion.
projectName	The name of your project in Matillion. You can only add the name of one project. If you want to create a technical lineage for other projects within the same group, create a new section in the lineage harvester configuration file.
environmentName	The name of your environment in Matillion. You can only add the name of one environment. If you want to create a technical lineage for other environments within the same project, create a new section in the lineage harvester configuration file.
dialect	The dialect of the database. You can enter one of the following values: `redshift`, for an Amazon Redshift data source. `snowflake`, for a Snowflake data source.
startTimestamp	The timestamp of tasks in Matillion. You can use this parameter to limit the amount of metadata that the lineage harvester scans. Specify this property with a UNIX timestamp in milliseconds. If this property remains empty or is deleted from the configuration file, all accessible tasks are scanned. Matillion provides seven days of history by default and automatically removes entries older than seven days.
collibraSystemName	Regardless of the value set for the `useCollibraSystemName` property, the following is true: You must include this property in your configuration file. You can leave this property empty. Any value that you give is ignored. If the `useCollibraSystemName` property is set to `true`, you must prepare a Matillion <source-ID> configuration file. In that case, the `CollibraSystemName` property in the <source ID> configuration file is taken into account. Note This is a legacy property that will be deprecated in a future release.
auth	The section contains the authentication details for signing in to Matillion.
type	The authentication method you want to use to sign in to Matillion. The value must be either: `Basic`, for username and password authentication. `Token`, for token-based authentication. Important These values are case-sensitive.
username	The username that you use to sign in to Matillion. Important This property is only required if you are using the username and password authentication method. If you are using token-based authentication, do not include this property.
deleteRawMetadataAfterProcessing	The lineage harvester harvests raw metadata from specified data sources and uploads it in a ZIP file to a Collibra Data Lineage service instance, for processing. You can use this optional property to specify whether or not the raw metadata should be deleted from Collibra Data Lineage service instance after the metadata that is targeted for ingestion in Data Catalog is processed. The default value is `false`. If the property is set to `true`, the raw source metadata is deleted after processing. If set to `false`, it is stored in the Collibra infrastructure. Note Setting this property to `true` can negatively impact performance.
<database properties>	This configuration section contains the required information of one individual data source with connection type "JDBC".
id	The unique ID of your data source. For example, `my_second_data_source`.
type	The kind of data source. In this case, the value has to be `Database`.
username	The username that you use to sign in to your data source.
dialect	The dialect of the database. See the list of allowed values. You can enter one of the following values: azure, for an Azure SQL Server data source. db2, for an IBM DB2 data source. hana, for a SAP HANA data source. hana-cviews, for a SAP HANA data source. Important The `hana-cviews` dialect is supported for SAP HANA (on-premises). It is not supported for SAP HANA Cloud. hive, for a HiveQL data source. greenplum, for a Greenplum data source. mssql, for a Microsoft SQL Server data source. mysql, for a MySQL data source. netezza, for a Netezza data source. oracle, for an Oracle data source. postgres, for a PostgreSQL data source. redshift, for an Amazon Redshift data source. spark, for a Spark SQL data source. sybase, for a Sybase data source. teradata, for a Teradata data source. If you want to use a Spark SQL data source, make sure that you have an AWS host.
databaseNames	The names or IDs of your databases. Enter the database names of your data source between double quotes (") and put everything between square brackets. If you want to include more than one database, separate them by a comma. For example, ["`MyFirstDatabase`", "`MySecondDatabase`"]. Note Ensure that you use the same database names as the full names of the Database assets. The names are case-sensitive. Important HiveQL, MySQL and Teradata data sources don't have schemas. Therefore, HiveQL, MySQL and Teradata databases are stored in Data Catalog and technical lineage as Schema assets. The technical lineage Browse tab pane shows the following names: For HiveQL and Teradata: The database name is the name that you enter for the `collibraSystemName` property. The schema name is the name that you enter for the `database` property. For MySQL: The database name is the name that you enter for the `database` property. Workaround for Oracle When ingesting an Oracle data source, the value of the `databaseNames` property in your configuration file must be either the Oracle SID or service name, depending on whether you set the `connectAsServiceName` property to `true` or `false`. This means that the database in the technical lineage will have the name of the Oracle SID or service name. However, if the database asset in Data Catalog reflects the true name of the database, stitching will break. To resolve this issue and preserve stitching, you need to rename the database asset in Data Catalog to match the value you put for the `databaseNames` property. This is a known issue that we will fix in a future version of Collibra. Tip To avoid this workaround, you can use the `"type": "DatabaseOracle"` and related properties in your configuration file. That allows you to specify the Oracle database name and preserve stitching in cases where the database name is not the same as the SID or service name.
externalDbName	This property can be considered a means of database mapping, to help preserve stitching. It is relevant only for HiveQL, MySQL and Teradata data sources, specifically because they are schema-less data sources. You can add the key/value pair to the configuration file, as follows: `"externalDbName": "CDATA"` See an example Let’s say you ingest a HiveQL data source via Edge. Note that Edge gives the name “CDATA” for the database. The full path to a column is something like: `Hive_123` (system) > `CDATA` (database) > `Hive_ABC` (schema) > `Table` > `Column` Now, because HiveQL is database-less, the value that you give for the `database` property in your configuration file is used as the schema name in the technical lineage, and the value you give for `collibraSystemName` is used as the database name. But if `useCollibraSystemName` is set to `true`, then the value of `collibraSystemName` is also used as the system name. In that case, in the full path to the column, the system name and the database name are the same: `Hive_123` (system) > `Hive_123` (database) > `Hive_ABC` (schema) > `Table` > `Column` Notice the mismatch between the database names. The `externalDbName` property tells the lineage harvester to use the value that you specify here for the database name in the technical lineage, specifically "CDATA”. This ensures that the full paths match and stitching is preserved.
hostname	The name of your database host.
collibraSystemName	The name of the data source's system or server. This is also the full name of your System asset in Data Catalog. Specify this property with the same name as the full name of the System asset that you create when you prepare the physical data layer in Data Catalog. If you don't prepare the physical data layer, Collibra Data Lineage cannot stitch the data objects in your technical lineage to the assets in Data Catalog.Specify this property with the same name as the full name of the System asset that you created when you registered the data source. If the `useCollibraSystemName` property is: `false` (default), system or server names in table references in analyzed SQL code are ignored. This means that a table that exists in two different systems or servers is identified (either correctly or incorrectly) as a single data object, with a single asset full name. `true`, system or server names in table references are considered to be represented by different System assets in Data Catalog. The value of the `collibraSystemName` field is used as the default system or server name.
port	The port number.
customConnectionProperties	An option to enable the lineage harvester to read additional connection parameters. This parameter is only required in very specific situations. If you don't need it, you can remove it from the configuration file.
deleteRawMetadataAfterProcessing	The lineage harvester harvests raw metadata from specified data sources and uploads it in a ZIP file to a Collibra Data Lineage service instance, for processing. You can use this optional property to specify whether or not the raw metadata should be deleted from Collibra Data Lineage service instance after the metadata that is targeted for ingestion in Data Catalog is processed. The default value is `false`. If the property is set to `true`, the raw source metadata is deleted after processing. If set to `false`, it is stored in the Collibra infrastructure. Note Setting this property to `true` can negatively impact performance.
<Oracle>	This configuration section contains the required information for an Oracle database. Tip We recommend the `"type": "DatabaseOracle"` configuration described in this section, because it allows you to specify the Oracle database name and preserve stitching in cases where the database name is not the same as the SID or service name. You can, however, still use the legacy `"type": "Database"` configuration to ingest Oracle databases.
id	The unique ID of your Oracle database. For example, `my_oracle_db`.
type	The kind of data source. In this case, the value has to be `DatabaseOracle`.
hostname	The name of your database host.
username	The username that you use to sign in to your Oracle database.
port	The port number.
sids	One or more system identifiers (SID). An SID is a unique name for an Oracle database instance on a specific host. You can use this property in conjunction with the `databaseNames` property, to preserve stitching. Important You must specify either one or more SIDs via this property, or one or more service names via the `serviceNames` property. You cannot include both properties in the configuration file. Show me examples of how to configure the sids property, with and without the databaseNames property Example 1: You include the `sids` property, but not the `databaseNames` property: { "id": "oracle1", "type": "DatabaseOracle", "hostname": "host_url", "username": "user1", "collibraSystemName": "automation_csn", "port": 1521, "sids": ["sid1", "sid2"] } Result: The database names in the technical lineage will be "sid1" and "sid2". If these don't match with your Database assets in Collibra, then stitching won't work. Example 2: You include the `sids` property and the `databaseNames` property: { "id": "oracle2", "type": "DatabaseOracle", "hostname": "host_url", "username": "user1", "collibraSystemName": "automation_csn", "port": 1521, "sids": ["sid1", "sid2"], "databaseNames": ["db1", "db2"] } Result: The SID "sid1" corresponds to the Database asset name "db1" in Collibra, therefore stitching is preserved. The same is true for SID "sid2" and Database asset name "db2".
serviceNames	One or more service names. A service name is the TNS alias that you give when you remotely connect to your database. You can use this property in conjunction with the `databaseNames` property, to preserve stitching. Important You must specify either one or more service names via this property, or one or more SIDs via the `sids` property. You cannot include both properties in the configuration file. Show me examples of how to configure the serviceNames property, with and without the databaseNames property Example 1: You include the `serviceNames` property, but not the `databaseNames` property: { "id": "oracle3", "type": "DatabaseOracle", "hostname": "host_url", "username": "user1", "collibraSystemName": "automation_csn", "port": 1521, "serviceNames": ["sn1", "sn2"] } Result: The database names in the technical lineage will be "sn1" and "sn2". If these don't match with your Database assets in Collibra, then stitching won't work. Example 2: You include the `serviceNames` property and the `databaseNames` property: { "id": "oracle4", "type": "DatabaseOracle", "hostname": "host_url", "username": "user1", "collibraSystemName": "automation_csn", "port": 1521, "serviceNames": ["sn1", "sn2"], "databaseNames": ["db1", "db2"] } Result: The service name "sn1" corresponds to the Database asset name "db1" in Collibra, therefore stitching is preserved. The same is true for service name "sn2" and Database asset name "db2".
databaseNames	The names of one or more Oracle databases. You can use this optional property in conjunction with the `sids` or `serviceNames` property, to preserve stitching. The value you specify has to match your Database asset (or assets) in Collibra. Enter the Oracle database names between double quotes ("") and put everything between square brackets. If you want to include more than one database, separate them by a comma. For example, ["`MyFirstDatabase`", "`MySecondDatabase`"]. If you use this property, the database names that you specify have to correlate with the databases that you specify in the `sids` or `serviceNames` property. If you don't use this property, the database name in the technical lineage will be the value that you put for the `sids` or `serviceNames` property. Tip For examples of how to configure this property, see the `sids` or `serviceNames` property descriptions and examples.
collibraSystemName	The name of the data source's system or server. This is also the full name of your System asset in Data Catalog. Specify this property with the same name as the full name of the System asset that you create when you prepare the physical data layer in Data Catalog. If you don't prepare the physical data layer, Collibra Data Lineage cannot stitch the data objects in your technical lineage to the assets in Data Catalog.Specify this property with the same name as the full name of the System asset that you created when you registered the data source. If the `useCollibraSystemName` property is: `false` (default), system or server names in table references in analyzed SQL code are ignored. This means that a table that exists in two different systems or servers is identified (either correctly or incorrectly) as a single data object, with a single asset full name. `true`, system or server names in table references are considered to be represented by different System assets in Data Catalog. The value of the `collibraSystemName` field is used as the default system or server name.
deleteRawMetadataAfterProcessing	The lineage harvester harvests raw metadata from specified data sources and uploads it in a ZIP file to a Collibra Data Lineage service instance, for processing. You can use this optional property to specify whether or not the raw metadata should be deleted from Collibra Data Lineage service instance after the metadata that is targeted for ingestion in Data Catalog is processed. The default value is `false`. If the property is set to `true`, the raw source metadata is deleted after processing. If set to `false`, it is stored in the Collibra infrastructure. Note Setting this property to `true` can negatively impact performance.
<Google BigQuery database>	This configuration section contains the required information for a Google BigQuery database.
id	The unique ID of your data source. For example, `my_third_data_source`.
type	The kind of data source. In this case, the value has to be `DatabaseBigQuery`.
projectIDs	The IDs of your Google BigQuery project. You can add multiple projects. For example, `[ "first-project", "second-project", "third-project" ]`. Note You have to use the same project ID as the full name of the Database asset that you create when you prepare the physical data layer in Data Catalog.
region	The location of your BigQuery data. This is the region that you specified when you create a data set. You can only add one location as value. However, you can create separate BigQuery entries per location in the configuration file. As a result, you create a complete technical lineage with Google BigQuery data from different locations. Note This property is optional.
auth	The path to a JSON file that contains authentication information. Tip For more information about setting up the authentication, see the Google Big Query user guide.
collibraSystemName	The name of the Google BigQuery system. This is also the full name of your System asset in Data Catalog. Specify this property with the same name as the full name of the System asset that you create when you prepare the physical data layer in Data Catalog. If you don't prepare the physical data layer, Collibra Data Lineage cannot stitch the data objects in your technical lineage to the assets in Data Catalog. Specify this property with the same name as the full name of the System asset that you created when you registered the data source.
deleteRawMetadataAfterProcessing	The lineage harvester harvests raw metadata from specified data sources and uploads it in a ZIP file to a Collibra Data Lineage service instance, for processing. You can use this optional property to specify whether or not the raw metadata should be deleted from Collibra Data Lineage service instance after the metadata that is targeted for ingestion in Data Catalog is processed. The default value is `false`. If the property is set to `true`, the raw source metadata is deleted after processing. If set to `false`, it is stored in the Collibra infrastructure. Note Setting this property to `true` can negatively impact performance.
sources	This section contains the Snowflake connection properties. If you want to create the technical lineage for multiple data sources, create a `sources` section for each data source.
id	The unique ID that identifies the data source on a Collibra Data Lineage service instance, for example, `my_snowflake_2`.
type	The type of data source. The value must be `DatabaseSnowflake`.
mode	The Snowflake ingestion methods that Collibra Data Lineage uses to ingest metadata from Snowflake data sources. Specify one of the following values: `SQL` The SQL Snowflake ingestion mode. Collibra Data Lineage creates a column-level technical lineage based on SQL statements. This is the default value. `SQL-API` The SQL-API Snowflake ingestion mode. Collibra Data Lineage creates a column-level technical lineage based on Snowflake schemas and the access history. For more information, go to Technical lineage for Snowflake ingestion methods.
collibraSystemName	The system or server name of the data source. This property is optional. Use this property with the `useCollibraSystemName` property to override the default Collibra System asset name for this data source. Specify this property with the same name as the full name of the System asset that you create when you prepare the physical data layer in Data Catalog. If you don't prepare the physical data layer, Collibra Data Lineage cannot stitch the data objects in your technical lineage to the assets in Data Catalog.
auth	This section indicates the authentication details to connect to the Snowflake database. Note The `username` and `auth` properties are mutually exclusive.
type	The authentication method. Specify one of the following values. The values are case-sensitive. `Basic` The username and password authentication method. Specify the `auth.username` property if you use this authentication method. `KeyPair` The key pair authentication method. Specify the `auth.pathToPrivateKey` and `auth.usePassword` properties if you use this authentication method.
username	The user name that you use to connect to the Snowflake database.
pathToPrivateKey	The path to your private key file. This property is required if you use the key pair authentication method. Ensure that the private key matches the public key; otherwise, an error occurs indicating that the JWT token is invalid. For more information about the error, go to Snowflake JDBC driver error at login: net.snowflake.client.jdbc.SnowflakeSQLException: JWT token is invalid in Collibra Support Portal.
usePassword	The private key file password. This property is required if you use the key pair authentication method. Specify one of the following values: `true` The password is required. `false` The password is not required. This is the default value.
username	The username that you use to sign in to your Snowflake data source. Note This property is deprecated. Use the auth property instead. The property and the auth property are mutually exclusive.
hostname	The URL that you use to access Snowflake web console. When you enter the URL, do not include `https://` or the trailing slash (/). For example, specify `<accountName>.snowflakecomputing.com`.
databaseNames	The names of your databases. Specify this property with the same databases names as the full names of the Database assets that you create when you prepare the physical data layer in Data Catalog. Enter the database names of your data source between double quotes ("") and put everything between square brackets ([]). If you want to include more than one database, separate them by a comma, for example, ["`MyFirstSnowflakeDatabase`", "`MySecondSnowflakeDatabase`"].
warehouse	The name of your virtual warehouse. This property is optional.
customConnectionProperties	An option to enable the lineage harvester to read additional connection parameters. This parameter is only required in very specific situations. If you don't need it, you can remove it from the configuration file. Example If you get an OSCP scan error, you can turn OSCP checking off by using the following value: `insecureMode=true`.
deleteRawMetadataAfterProcessing	The lineage harvester harvests raw metadata from specified data sources and uploads it in a ZIP file to a Collibra Data Lineage service instance, for processing. You can use this optional property to specify whether or not the raw metadata should be deleted from Collibra Data Lineage service instance after the metadata that is targeted for ingestion in Data Catalog is processed. The default value is `false`. If the property is set to `true`, the raw source metadata is deleted after processing. If set to `false`, it is stored in the Collibra infrastructure. Note Setting this property to `true` can negatively impact performance.
<SQL files in the lineage harvester output folder>	This configuration section contains the required information for SQL files of a data source that were previously downloaded by the lineage harvester and is stored in the lineage harvester output folder.
type	The kind of data source. In this case, the value has to be `LoadedSource`.
id	The unique ID of the data source that you uploaded to the lineage harvester folder. For example, `my_loaded_snowflake_source`.
zipFile	The full path to the ZIP file that was created in the lineage harvester folder.
deleteRawMetadataAfterProcessing	The lineage harvester harvests raw metadata from specified data sources and uploads it in a ZIP file to a Collibra Data Lineage service instance, for processing. You can use this optional property to specify whether or not the raw metadata should be deleted from Collibra Data Lineage service instance after the metadata that is targeted for ingestion in Data Catalog is processed. The default value is `false`. If the property is set to `true`, the raw source metadata is deleted after processing. If set to `false`, it is stored in the Collibra infrastructure. Note Setting this property to `true` can negatively impact performance.

Save the configuration file.

Example

{
	"general" : {
		"catalog" : {
			"url" : "https://<organization>.collibra.com", 
			"username" : "<your-collibra-username>" 
			},
			"useCollibraSystemName" : false
		},
	"sources" : [  
		{
		"type" : "AzureDataFactory",
		"id": "adf_source",
		"collibraSystemName": "this_is_ignored",
		"tenantDomain": "my_server.onmicrosoft.com",
		"loginFlow": {
			"type": "ServicePrincipal",
			"applicationId": "00000000-1111-2222-3333-444444444444"
			},
		"subscriptionId" : "99999999-8888-7777-6666-555555555555",
		"resourceGroupName" : "adf_read_only_resource_group"
		}
	]
}

What's next

Run the lineage harvester. When you run the lineage harvester and encounter errors that are related to the lineage harvester configuration file, you can use the technical lineage troubleshooting guide or Collibra Support Portal to fix the errors.