Prepare the lineage harvester configuration file

Warning The lineage harvester is now deprecated and will officially reach its end-of-life on July 31, 2026. To ensure a smooth transition, we encourage you to begin creating technical lineage via Edge, if you haven't already.

Before you can visualize the technical lineage, you have to create a configuration file for the (meta)data sources that you want to process. This configuration file is used by the lineage harvester to extract data from (meta)data sources for which you want to create a technical lineage or you want to ingest.

If you use multiple lineage harvesters on different servers, you can create a separate configuration file for the lineage harvester on each server and configure different data sources in each configuration file.

Note

Technical lineage supports a limited list of (meta)data sources.
In all lineage harvester files, you must use UTF-8 or ISO-8859-1 characters, with the exception of SQL files, which can only be UTF-8 encoded.
Each data source has an ID property. The ID string must be unique and human readable. The ID can be anything and is only used to identify the batch of metadata that is processed on the Collibra Data Lineage service.
The lineage harvester connects to different Collibra Data Lineage service instances based on your geographical location and cloud provider. Make sure you have the correct system requirements before you run the lineage harvester. If your location or cloud provider changes, the lineage harvester rescans all your data sources.
Comments in the lineage harvester configuration file are not supported.

Before you begin

Download and install the lineage harvester.

Tip You can use the configuration file generator to create an example configuration file to accommodate the data sources you specify in the generator. You can then copy the example code to your configuration file and replace the values of the properties to suit your needs.

Steps

Open the lineage-harvester.conf file that was created when you installed the lineage harvester, and enter the values for each property.

Properties	Description
general	This section describes the connection between Collibra Data Lineage and Data Catalog.
techlin	This section contains information that is necessary to connect to the Collibra Data Lineage service instance. Warning This section applies only to US government customers.
url	The URL of the Collibra Data Lineage service instance. Example “url”: “https://techlin-gov.collibra.com” Warning This section applies only to US government customers.
userKey	The unique API key to connect to the Collibra Data Lineage service instance. A unique user key is needed for each Collibra environment. If you're not sure what your user key is, please contact your Collibra Collibra Account Team. Warning This section applies only to US government customers.
catalog	This section contains information that is necessary to connect to Data Catalog. Note Versions of the lineage harvester older than 1.1.2 show `collibra` instead of `catalog`.
url	The URL of your Collibra environment. Note Enter the public URL of your Collibra environment. Other URLs are not accepted.
username	The username that you use to sign in to Collibra.
useCollibraSystemName	Indicates whether or not you want to use the system or server name of a JDBC data source to match the System asset that you created when you prepared the physical data layer. The names are case-sensitive. Specify one of the following values: `false` The lineage harvester ignores all system or server names that you specify on the `collibraSystemName` properties in the <source ID> file. This is the default value. `true` The lineage harvester reads the system and server names that you specify on the `collibraSystemName` properties in all sections of the <source ID> file. Note Specify this property with the value of `true` only when you have multiple databases with the same name.
sources	This section describes the data sources for which you want to create the technical lineage. You have to create a configuration section for each data source.
id	This property is used to identify the batch of harvested metadata on the Collibra Data Lineage service instance. The value can be anything as long as it is unique and human readable. For example, my_adf.
type	The type of data source. The value must be AzureDataFactory.
collibraSystemName (Deprecated)	This property is deprecated. If you specify a value for this property, it is ignored. To override the default Collibra System asset name, use the `useCollibraSystemName` property in the source id file.
tenantDomain	The directory ID of the Azure Data Factory instance.
loginFlow	This section contains the login application information.
applicationId	The application ID of the Azure Data Factory instance.
type	The identity of the application. The value has to be ServicePrincipal.
resourceGroupName	The name of the resource group with the Reader role for the Azure Data Factory instance.
subscriptionId	The subscription ID of the resource group.
factories	The Azure Data Factory factories that the lineage harvester collects and processes. Specify this property with an array of Azure Data Factory factory names. This property is optional. The following rules apply when you specify this property: Enter the factory names in square brackets ([ ]), enclose each factory name in double quotes (" "), and separate them by a comma, for example, ["MyFirstFactory", "MySecondFactory"]. The factory name is not case-sensitive. For example, the MyFactory and myfactory factories are considered the same by Azure Data Factory and the lineage harvester. If you do not specify any factory name, the lineage harvester collects and processes all factories that have datasets and piplelines in them.
dependentSourceIds	Option to specify data source dependencies for the sharing of database models. Sharing database models allows you to provide table-definition details from an independent data source to a data source that is dependent on those details. This is needed to avoid analysis errors and to have a complete lineage that includes lineage from the SQL statements from dependent data sources. If Database2 is dependent on Database1, include the `dependentSourceIds` property and specify the Source ID of Database1: `"dependentSourceIds": ["<source ID of Database1>"]` If, for example, Database2 is dependent on more than one data source, specify the independent sources, as follows: `"dependentSourceIds": ["<source ID of an independent source>", "<source ID of another independent source>"]` Important If a dependent data source contains lowercase column names, this feature will only work for the following dialects: Oracle, Snowflake, and Teradata. For all other dialects: An analyze error is raised, prompting you to provide the DDL file. The only workaround is to consolidate your SQL statements and DDL file in a single data source. For complete information, go to Sharing database models across data sources.
deleteRawMetadataAfterProcessing	The lineage harvester harvests raw metadata from specified data sources and uploads it in a ZIP file to a Collibra Data Lineage service instance, for processing. You can use this optional property to specify whether or not the raw metadata should be deleted from Collibra Data Lineage service instance after the metadata that is targeted for ingestion in Data Catalog is processed. The default value is `false`. If the property is set to `true`, the raw source metadata is deleted after processing. If set to `false`, it is stored in the Collibra infrastructure.

Save the configuration file.

Steps

Open the lineage-harvester.conf file that was created when you installed the lineage harvester, and enter the values for each property.

Properties	Description
general	This section describes the connection between Collibra Data Lineage and Data Catalog.
techlin	This section contains information that is necessary to connect to the Collibra Data Lineage service instance. Warning This section applies only to US government customers.
url	The URL of the Collibra Data Lineage service instance. Example “url”: “https://techlin-gov.collibra.com” Warning This section applies only to US government customers.
userKey	The unique API key to connect to the Collibra Data Lineage service instance. A unique user key is needed for each Collibra environment. If you're not sure what your user key is, please contact your Collibra Collibra Account Team. Warning This section applies only to US government customers.
catalog	This section contains information that is necessary to connect to Data Catalog. Note Versions of the lineage harvester older than 1.1.2 show `collibra` instead of `catalog`.
url	The URL of your Collibra environment. Note Enter the public URL of your Collibra environment. Other URLs are not accepted.
username	The username that you use to sign in to Collibra.
useCollibraSystemName	Indicates whether or not you want to use the system or server name of a JDBC data source to match the System asset that you created when you prepared the physical data layer. The names are case-sensitive. Specify one of the following values: `false` The lineage harvester ignores all system or server names that you specify on the `collibraSystemName` properties in the configuration file. This is the default value. `true` The lineage harvester reads the system and server names that you specify on the `collibraSystemName` properties in all sections of the configuration file. Only specify this value when you have multiple databases with the same name. Note For SQL data sources, if this property is: `false`, system or server names in table references in analyzed SQL code are ignored. This means that a table that exists in two different systems or servers is identified (either correctly or incorrectly) as a single data object, with a single asset name. `true`, system or server names in table references are considered to be represented by different System assets in Data Catalog. The value of the `collibraSystemName` property is used as the default system or server name.
sources	This section describes the data sources for which you want to create the technical lineage. You have to create a configuration section for each data source. This configuration section contains the required information of one individual SQL directory with connection type "Folder". Note You can add multiple data sources to the same configuration file.
id	This property is used to identify the batch of harvested metadata on the Collibra Data Lineage service instance. The value can be anything as long as it is unique and human readable.
type	The kind of data source. In this case, the value has to be `SqlDirectory`.
path	The full path to the folder where you added SQL files, for example, `C:\path\to\config\dir`.
mask	The pattern of the file names in the directory. By default, this is `*`.
recursive	Indication of the files you want to harvest: `false` (default): Only harvest the files in directly under the folder in the SQL directory path. `true`: Harvest all files under the folder in the SQL directory path and subdirectories.
dialect	The dialect of the database: `redshiftazurebigquerygreenplumhivedb2oraclepostgresmssqlmysqlnetezzasnowflakesybasesparkteradata` `hana`, for an SAP HANA data source. `hana-cviews`, for getting lineage from calculated views in an SAP HANA Classic on-premises data source. `hana-cviews-v2`, for getting lineage from calculated views in an SAP HANA Cloud/Advanced data source. Important To get technical lineage including calculated views, you must harvest SAP HANA by specifying two data sources in the lineage harvester configuration file. In one data source, specify the `hana` dialect, and in the other, specify the `hana-cviews` or `hana-cviews-v2` dialect. The value your put for this property has to match the dialect you provide with in the directory with your SQL files.
database	The name of your database, which is the name of your Database asset. Note You have to use the same database name as the name of the Database asset that you create when you prepare the physical data layer in Data Catalog. The names are case-sensitive. The database and schema names in the SQL statements in your SQL files take precedence over the values that you provide for the `database` and `schema` properties in the lineage harvester configuration file. If your SQL statements contain database and schema names, Collibra Data Lineage uses them for stitching. If your SQL statements do not contain database and schema names, Collibra Data Lineage uses the values of the `database` and `schema` properties in the configuration file for stitching.. For more information, go to Prepare an SQL directory and Automatic stitching for technical lineage. Important HiveQL data sources don't have schemas. Therefore, HiveQL databases are stored in Data Catalog and technical lineage as Schema assets. The technical lineageBrowse tab pane shows the following names: The database name is the name that you enter for the `collibraSystemName` property. The schema name is the name that you enter for the `database` property. Important MySQL data sources don't have schemas. Therefore, MySQL databases are stored in Data Catalog and technical lineage as Schema assets. The technical lineageBrowse tab pane shows the following names: The database name is the name that you enter for the `database` property. Important Teradata data sources don't have schemas. Therefore, Teradata databases are stored in Data Catalog and technical lineage as Schema assets. The technical lineageBrowse tab pane shows the following names: The database name is the name that you enter for the `collibraSystemName` property. The schema name is the name that you enter for the `database` property.
collibraSystemName	The name of the data source's system or server. This is also the name of your System asset in Data Catalog. Specify this property with the same name as the name of the System asset that you create when you prepare the physical data layer in Data Catalog. If you don't prepare the physical data layer, Collibra Data Lineage cannot stitch the data objects in your technical lineage to the assets in Data Catalog.Specify this property with the same name as the name of the System asset that you created when you registered the data source.
databaseSystemMapping	This optional property allows you to map databases to their rightful systems, to obtain stitching. This resolves missing stitching, which occurs when Collibra Data Lineage associates multiple databases with the default system name that you provide in the `collibraSystemName` property. More information and example configuration Let’s say you use the `collibraSystemName` property to specify the system name SystemA. This system name will be associated with all of the databases mentioned in your SQL statements in your SQL files. Now let’s say that you have a SQL statement that selects data from a database DB2 (which is in SystemB) and inserts the data in database DB3 (which is part of SystemC). Both of these databases will be associated with SystemA, and both will appear under SystemA in the Browse tab pane. Therefore, stitching fails and duplicate databases might appear. To resolve this issue and obtain stitching, you can use this property to map DB2 to its rightful SystemB, and DB3 to its rightful SystemC. { "id": "sql_dir", "type": "SqlDirectory", "path": "/tmp/sqldir", "dialect": "bigquery", "collibraSystemName": "SystemA", "database": "DB", "schema": "SCH", "databaseSystemMapping": { "DB2": "SystemB", "DB3": "SystemC" } }
schema	The name of the default schema, if not specified in the data source itself. This corresponds to name of your Schema asset. Note You must use the same schema name as the name of the Schema asset that you create when you prepare the physical data layer in Data Catalog.
verbose	Indication whether you want to enable verbose logging. By default this is set to `True`. If you don't want to use verbose logging, set it to `False`.
dependentSourceIds	Option to specify data source dependencies for the sharing of database models. Sharing database models allows you to provide table-definition details from an independent data source to a data source that is dependent on those details. This is needed to avoid analysis errors and to have a complete lineage that includes lineage from the SQL statements from dependent data sources. If Database2 is dependent on Database1, include the `dependentSourceIds` property and specify the Source ID of Database1: `"dependentSourceIds": ["<source ID of Database1>"]` If, for example, Database2 is dependent on more than one data source, specify the independent sources, as follows: `"dependentSourceIds": ["<source ID of an independent source>", "<source ID of another independent source>"]` Important If a dependent data source contains lowercase column names, this feature will only work for the following dialects: Oracle, Snowflake, and Teradata. For all other dialects: An analyze error is raised, prompting you to provide the DDL file. The only workaround is to consolidate your SQL statements and DDL file in a single data source. For complete information, go to Sharing database models across data sources.
deleteRawMetadataAfterProcessing	The lineage harvester harvests raw metadata from specified data sources and uploads it in a ZIP file to a Collibra Data Lineage service instance, for processing. You can use this optional property to specify whether or not the raw metadata should be deleted from Collibra Data Lineage service instance after the metadata that is targeted for ingestion in Data Catalog is processed. The default value is `false`. If the property is set to `true`, the raw source metadata is deleted after processing. If set to `false`, it is stored in the Collibra infrastructure.

Properties	Description
general	This section describes the connection between Collibra Data Lineage and Data Catalog.
techlin	This section contains information that is necessary to connect to the Collibra Data Lineage service instance. Warning This section applies only to US government customers.
url	The URL of the Collibra Data Lineage service instance. Example “url”: “https://techlin-gov.collibra.com” Warning This section applies only to US government customers.
userKey	The unique API key to connect to the Collibra Data Lineage service instance. A unique user key is needed for each Collibra environment. If you're not sure what your user key is, please contact your Collibra Collibra Account Team. Warning This section applies only to US government customers.
catalog	This section contains information that is necessary to connect to Data Catalog. Note Versions of the lineage harvester older than 1.1.2 show `collibra` instead of `catalog`.
url	The URL of your Collibra environment. Note Enter the public URL of your Collibra environment. Other URLs are not accepted.
username	The username that you use to sign in to Collibra.
useCollibraSystemName	Indicates whether or not you want to use the system or server name of a JDBC data source to match the System asset that you created when you prepared the physical data layer. The names are case-sensitive. Specify one of the following values: `false` The lineage harvester ignores all system or server names that you specify on the `collibraSystemName` properties in the configuration file. This is the default value. `true` The lineage harvester reads the system and server names that you specify on the `collibraSystemName` properties in all sections of the configuration file. Only specify this value when you have multiple databases with the same name. Note For SQL data sources, if this property is: `false`, system or server names in table references in analyzed SQL code are ignored. This means that a table that exists in two different systems or servers is identified (either correctly or incorrectly) as a single data object, with a single asset name. `true`, system or server names in table references are considered to be represented by different System assets in Data Catalog. The value of the `collibraSystemName` property is used as the default system or server name.
sources	This section describes the data sources for which you want to create the technical lineage. You have to create a configuration section for each data source. This section contains the required information of one individual data source with connection type "JDBC". Note You can add multiple data sources to the same configuration file.
id	The unique ID of the data source. For example, `my_first_data_source`.
type	The kind of data source. In this case, the value has to be `Database`.
username	The username that you use to sign in to your data source.
dialect	The dialect of the database: `redshiftazurebigquerygreenplumhivedb2oraclepostgresmssqlmysqlnetezzasnowflakesybasesparkteradata`. `hana`, for an SAP HANA data source. `hana-cviews`, for getting lineage from calculated views in an SAP HANA Classic on-premises data source. `hana-cviews-v2`, for getting lineage from calculated views in an SAP HANA Cloud/Advanced data source. Important To get technical lineage including calculated views, you must harvest SAP HANA by specifying two data sources in the lineage harvester configuration file. In one data source, specify the `hana` dialect, and in the other, specify the `hana-cviews` or `hana-cviews-v2` dialect. The value your put for this property has to match the dialect you provide with in the directory with your SQL files.
databaseNames	The names or IDs of your databases. Enter the database names of your data source between double quotes (") and put everything between square brackets. If you want to include more than one database, separate them by a comma. For example, ["`MyFirstDatabase`", "`MySecondDatabase`"]. Note Ensure that you use the same database names as the names of the Database assets. The names are case-sensitive. Important HiveQL, Spark SQL, and Teradata are database-less data sources. Therefore, HiveQL, Spark SQL, and Teradata databases are stored in Data Catalog and technical lineage as Schema assets. The technical lineage Browse tab pane shows the following names: The database name is the name that you enter for the `externalDbName` property. The schema name is the name that you enter for the `database` property. If you do not specify a value for the `externalDbName` property, Collibra Data Lineage uses the value of the `collibraSystemName` property as the database name. For details, see the `externalDbName` property above.
externalDbName	This property can be considered a means of database mapping, to help preserve stitching. Note This property is relevant only for HiveQL, Spark SQL, and Teradata data sources, specifically because they are database-less data sources. You can add the key/value pair to the configuration file, as follows: `"externalDbName": "<dbname>"`, where `<dbname>` is one of the following values: `CData`, which Cdata drivers returned as a placeholder. Use this value if you did not create a custom database name by using the `CustomizedDefaultCatalogName` property when you registered your data source. The custom database name that you specified for the `CustomizedDefaultCatalogName` property when you registered your data source. For more information about the `CustomizedDefaultCatalogName` connection property, go to Customizing the database name for database-less data sources. See an example Let’s say you ingest a HiveQL data source via Edge. If you do not customize the database name, Edge gives the name "CData” for the database. The full path to a column is something like: `Hive_123` (system) > `CData` (database) > `Hive_ABC` (schema) > `Table` > `Column` Now, because HiveQL is database-less, the value that you give for the `databaseNames` property in your configuration file is used as the schema name in the technical lineage, and the value you give for `collibraSystemName` is used as the database name. But if `useCollibraSystemName` is set to `true`, then the value of `collibraSystemName` is also used as the system name. In that case, in the full path to the column, the system name and the database name are the same: `Hive_123` (system) > `Hive_123` (database) > `Hive_ABC` (schema) > `Table` > `Column` Notice the mismatch between the database names. The `externalDbName` property tells the lineage harvester to use the value that you specify here for the database name in the technical lineage, specifically "CData”. This ensures that the full paths match and stitching is preserved. See code examples The following examples show the full paths when you set the `useCollibraSystemName` and `externalDbName` properties differently. Code example 1 The `useCollibraSystemName` property is set to `false` and the `externalDbName` property is specified In the following example, the `useCollibraSystemName` property is set to `false`, and the `externalDbName` property is set as `CData`. The data objects from the HiveQL data source are stitched to the assets in Data Catalog with the following path: `CData` (database) > `Hive_ABC` (schema) > `Table` > `Column` { "general": { ⁞ "useCollibraSystemName" : false }, "sources" : [ { "id" : "database_source", "type" : "Database", "username" : "MyUsername", "dialect" : "hive", "externalDbName": "CData" "databaseNames" : ["Hive_ABC"], "collibraSystemName" : ["Hive_123"], ⁞ } ] } Code example 2 The `useCollibraSystemName` property is set to `true` and the `externalDbName` property is specified In the following example, the `useCollibraSystemName` property is set to `true`, and the `externalDbName` property is set as `CData`. The data objects from the HiveQL data source are stitched to the assets in Data Catalog with the following path: `Hive_123` (system) > `CData` (database) > `Hive_ABC` (schema) > `Table` > `Column` { "general": { ⁞ "useCollibraSystemName" : false }, "sources" : [ { "id" : "database_source", "type" : "Database", "username" : "MyUsername", "dialect" : "hive", "externalDbName": "CData" "databaseNames" : ["Hive_ABC"], "collibraSystemName" : ["Hive_123"], ⁞ } ] } Code example 3 The `useCollibraSystemName` property is set to `true` and the `externalDbName` property is not specified In the following example, the `useCollibraSystemName` property is set to `true`, but the `externalDbName` property is not specified. The data objects from the HiveQL data source are not stitched to the assets in Data Catalog because of the following mismatch: In Data Catalog, the full path is `Hive_123` (system) > `CData` (database) > `Hive_ABC` (schema) > `Table` > `Column`. However, the lineage harvester has `Hive_123` (system) > `Hive_123` (database) > `Hive_ABC` (schema) > `Table` > `Column` { "general": { ⁞ "useCollibraSystemName" : false }, "sources" : [ { "id" : "database_source", "type" : "Database", "username" : "MyUsername", "dialect" : "hive", "externalDbName": "" "databaseNames" : ["Hive_ABC"], "collibraSystemName" : ["Hive_123"], ⁞ } ] }
hostname	The name of your database host.
collibraSystemName	The name of the data source's system or server. This is also the name of your System asset in Data Catalog. Specify this property with the same name as the name of the System asset that you create when you prepare the physical data layer in Data Catalog. If you don't prepare the physical data layer, Collibra Data Lineage cannot stitch the data objects in your technical lineage to the assets in Data Catalog.Specify this property with the same name as the name of the System asset that you created when you registered the data source. If the `useCollibraSystemName` property is: `false` (default), system or server names in table references in analyzed SQL code are ignored. This means that a table that exists in two different systems or servers is identified (either correctly or incorrectly) as a single data object, with a single asset name. `true`, system or server names in table references are considered to be represented by different System assets in Data Catalog. The value of the `collibraSystemName` field is used as the default system or server name.
databaseSystemMapping	This optional property allows you to map databases to their rightful systems, to obtain stitching. This resolves missing stitching, which occurs when Collibra Data Lineage associates multiple databases with the default system name that you provide in the `collibraSystemName` property. More information and example configuration Let’s say you use the `collibraSystemName` property to specify the system name SystemA. This system name will be associated with all of the databases mentioned in your SQL statements in your SQL files. Now let’s say that you have a SQL statement that selects data from a database DB2 (which is in SystemB) and inserts the data in database DB3 (which is part of SystemC). Both of these databases will be associated with SystemA, and both will appear under SystemA in the Browse tab pane. Therefore, stitching fails and duplicate databases might appear. To resolve this issue and obtain stitching, you can use this property to map DB2 to its rightful SystemB, and DB3 to its rightful SystemC. { "id": "sql_dir", "type": "SqlDirectory", "path": "/tmp/sqldir", "dialect": "bigquery", "collibraSystemName": "SystemA", "database": "DB", "schema": "SCH", "databaseSystemMapping": { "DB2": "SystemB", "DB3": "SystemC" } }
port	The port number.
customConnectionProperties	An option to enable the lineage harvester to read additional connection parameters. This parameter is only required in very specific situations. If you don't need it, you can remove it from the configuration file.
dependentSourceIds	Option to specify data source dependencies for the sharing of database models. Sharing database models allows you to provide table-definition details from an independent data source to a data source that is dependent on those details. This is needed to avoid analysis errors and to have a complete lineage that includes lineage from the SQL statements from dependent data sources. If Database2 is dependent on Database1, include the `dependentSourceIds` property and specify the Source ID of Database1: `"dependentSourceIds": ["<source ID of Database1>"]` If, for example, Database2 is dependent on more than one data source, specify the independent sources, as follows: `"dependentSourceIds": ["<source ID of an independent source>", "<source ID of another independent source>"]` Important If a dependent data source contains lowercase column names, this feature will only work for the following dialects: Oracle, Snowflake, and Teradata. For all other dialects: An analyze error is raised, prompting you to provide the DDL file. The only workaround is to consolidate your SQL statements and DDL file in a single data source. For complete information, go to Sharing database models across data sources.
deleteRawMetadataAfterProcessing	The lineage harvester harvests raw metadata from specified data sources and uploads it in a ZIP file to a Collibra Data Lineage service instance, for processing. You can use this optional property to specify whether or not the raw metadata should be deleted from Collibra Data Lineage service instance after the metadata that is targeted for ingestion in Data Catalog is processed. The default value is `false`. If the property is set to `true`, the raw source metadata is deleted after processing. If set to `false`, it is stored in the Collibra infrastructure.

Save the configuration file.

Steps

Open the lineage-harvester.conf file that was created when you installed the lineage harvester, and enter the values for each property.

Code example

In the following example, the path of /opt/Collibra/techlin/dbt-core-files/ contains a target/ directory, which includes the compiled SQL files and manifest.json file:

ls /opt/Collibra/techlin/dbt-core-files/target
ls /opt/Collibra/techlin/dbt-core-files/target compiled  manifest.json

{
	"general" : {
	    "catalog" : {
		"url" : "https://<organization>.collibra.com",
		"username" : "<your-collibra-username>"
		},
		"useCollibraSystemName" : false
	    },
	"sources" : [
	{
	  "id" : "my_dbt_source",
	  "collibraSystemName" : "",
	  "type" : "ExternalDirectory",
	  "dirType" : "dbt",
	  "path" : "/opt/Collibra/techlin/dbt-core-files/",
	  "recursive" : true,
	  "deleteRawMetadataAfterProcessing" : false
       }
     ]
}

Properties	Description	Required
general	This section describes the connection between Collibra Data Lineage and Data Catalog.	Yes
techlin	This section contains information that is necessary to connect to the Collibra Data Lineage service instance. Warning This section applies only to US government customers.	Yes for US government customers
url	The URL of the Collibra Data Lineage service instance. Example “url”: “https://techlin-gov.collibra.com” Warning This section applies only to US government customers.	Yes for US government customers
userKey	The unique API key to connect to the Collibra Data Lineage service instance. A unique user key is needed for each Collibra environment. If you're not sure what your user key is, please contact your Collibra Collibra Account Team. Warning This section applies only to US government customers.	Yes for US government customers
catalog	This section contains information that is necessary to connect to Data Catalog. Note Versions of the lineage harvester older than 1.1.2 show `collibra` instead of `catalog`.	Yes
url	The URL of your Collibra environment. Note Enter the public URL of your Collibra environment. Other URLs are not accepted.	Yes
username	The username that you use to sign in to Collibra.	Yes
useCollibraSystemName	Indicates whether or not you want to use the system or server name of a JDBC data source to match the System asset that you created when you prepared the physical data layer. The names are case-sensitive. Specify one of the following values: `false` The lineage harvester ignores all system or server names that you specify on the `collibraSystemName` properties in the configuration file. This is the default value. `true` The lineage harvester reads the system and server names that you specify on the `collibraSystemName` properties in all sections of the configuration file. Only specify this value when you have multiple databases with the same name.	No
sources	This configuration section contains the required information of dbt Core data source. Note Make sure that you have prepared a local folder with the SQL files and Manifest JSON file for which you want to create a technical lineage.	Yes
collibraSystemName	The system or server name of the data source. Use this property with the `useCollibraSystemName` property in the configuration file to override the default Collibra System asset name for this data source. Specify this property with the same name as the name of the System asset that you create when you prepare the physical data layer in Data Catalog. If you don't prepare the physical data layer, Collibra Data Lineage cannot stitch the data objects in your technical lineage to the assets in Data Catalog	No
id	The unique ID that is used to identify the data source on the Collibra Data Lineage service instances. For example, my_dbt.	Yes
type	The kind of data source. In this case, the value has to be ExternalDirectory.	Yes
dirType	The type of external directory. The value has to be dbt.	Yes
path	The full path to the external directory that you created, for example, `/opt/dbt/my-project/` or `/opt/Collibra/techlin/dbt-core-files`. Ensure that the target/ directory is in the external directory.	Yes
mask	The pattern of the file names in the directory. By default, this is `*`, which sends the SQL and JSON files to the Collibra Data Lineage service instance.	No
recursive	Indication whether you want to use recursive queries. You must set the value to `true`. By default, this is set to `false`.	Yes
dependentSourceIds	Option to specify data source dependencies for the sharing of database models. Sharing database models allows you to provide table-definition details from an independent data source to a data source that is dependent on those details. This is needed to avoid analysis errors and to have a complete lineage that includes lineage from the SQL statements from dependent data sources. If Database2 is dependent on Database1, include the `dependentSourceIds` property and specify the Source ID of Database1: `"dependentSourceIds": ["<source ID of Database1>"]` If, for example, Database2 is dependent on more than one data source, specify the independent sources, as follows: `"dependentSourceIds": ["<source ID of an independent source>", "<source ID of another independent source>"]` Important If a dependent data source contains lowercase column names, this feature will only work for the following dialects: Oracle, Snowflake, and Teradata. For all other dialects: An analyze error is raised, prompting you to provide the DDL file. The only workaround is to consolidate your SQL statements and DDL file in a single data source. For complete information, go to Sharing database models across data sources.
deleteRawMetadataAfterProcessing	The lineage harvester harvests raw metadata from specified data sources and uploads it in a ZIP file to a Collibra Data Lineage service instance, for processing. You can use this optional property to specify whether or not the raw metadata should be deleted from Collibra Data Lineage service instance after the metadata that is targeted for ingestion in Data Catalog is processed. The default value is `false`. If the property is set to `true`, the raw source metadata is deleted after processing. If set to `false`, it is stored in the Collibra infrastructure.	No

Save the configuration file.

Steps

Open the lineage-harvester.conf file that was created when you installed the lineage harvester, and enter the values for each property.

Properties	Description
general	This section describes the connection between Collibra Data Lineage and Data Catalog.
techlin	This section contains information that is necessary to connect to the Collibra Data Lineage service instance. Warning This section applies only to US government customers.
url	The URL of the Collibra Data Lineage service instance. Example “url”: “https://techlin-gov.collibra.com” Warning This section applies only to US government customers.
userKey	The unique API key to connect to the Collibra Data Lineage service instance. A unique user key is needed for each Collibra environment. If you're not sure what your user key is, please contact your Collibra Collibra Account Team. Warning This section applies only to US government customers.
catalog	This section contains information that is necessary to connect to Data Catalog. Note Versions of the lineage harvester older than 1.1.2 show `collibra` instead of `catalog`.
url	The URL of your Collibra environment. Note Enter the public URL of your Collibra environment. Other URLs are not accepted.
username	The username that you use to sign in to Collibra.
useCollibraSystemName	Indicates whether or not you want to use the system or server name of a JDBC data source to match the System asset that you created when you prepared the physical data layer. The names are case-sensitive. Specify one of the following values: `false` The lineage harvester ignores all system or server names that you specify on the `collibraSystemName` properties in the configuration file. This is the default value. `true` The lineage harvester reads the system and server names that you specify on the `collibraSystemName` properties in all sections of the configuration file. Only specify this value when you have multiple databases with the same name. Note For SQL data sources, if this property is: `false`, system or server names in table references in analyzed SQL code are ignored. This means that a table that exists in two different systems or servers is identified (either correctly or incorrectly) as a single data object, with a single asset name. `true`, system or server names in table references are considered to be represented by different System assets in Data Catalog. The value of the `collibraSystemName` property is used as the default system or server name.
sources	This section describes the data sources for which you want to create the technical lineage. You have to create a configuration section for each data source. This configuration section contains the required information of one individual SQL directory with connection type "Folder". Note You can add multiple data sources to the same configuration file.
id	This property is used to identify the batch of harvested metadata on the Collibra Data Lineage service instance. The value can be anything as long as it is unique and human readable.
type	The kind of data source. In this case, the value has to be `SqlDirectory`.
path	The full path to the folder where you added SQL files, for example, `C:\path\to\config\dir`.
mask	The pattern of the file names in the directory. By default, this is `*`.
recursive	Indication of the files you want to harvest: `false` (default): Only harvest the files in directly under the folder in the SQL directory path. `true`: Harvest all files under the folder in the SQL directory path and subdirectories.
dialect	The dialect of the database. For example, bigquery. The value your put for this property has to match the dialect you provide with in the directory with your SQL files.
database	The name of your database, which is the name of your Database asset. Note You have to use the same database name as the name of the Database asset that you create when you prepare the physical data layer in Data Catalog. The names are case-sensitive. The database and schema names in the SQL statements in your SQL files take precedence over the values that you provide for the `database` and `schema` properties in the lineage harvester configuration file. If your SQL statements contain database and schema names, Collibra Data Lineage uses them for stitching. If your SQL statements do not contain database and schema names, Collibra Data Lineage uses the values of the `database` and `schema` properties in the configuration file for stitching.. For more information, go to Prepare an SQL directory and Automatic stitching for technical lineage. Important MySQL data sources don't have schemas. Therefore, MySQL databases are stored in Data Catalog and technical lineage as Schema assets. The technical lineageBrowse tab pane shows the following names: The database name is the name that you enter for the `database` property.
collibraSystemName	The name of the data source's system or server. This is also the name of your System asset in Data Catalog. Specify this property with the same name as the name of the System asset that you create when you prepare the physical data layer in Data Catalog. If you don't prepare the physical data layer, Collibra Data Lineage cannot stitch the data objects in your technical lineage to the assets in Data Catalog.Specify this property with the same name as the name of the System asset that you created when you registered the data source.
databaseSystemMapping	This optional property allows you to map databases to their rightful systems, to obtain stitching. This resolves missing stitching, which occurs when Collibra Data Lineage associates multiple databases with the default system name that you provide in the `collibraSystemName` property. More information and example configuration Let’s say you use the `collibraSystemName` property to specify the system name SystemA. This system name will be associated with all of the databases mentioned in your SQL statements in your SQL files. Now let’s say that you have a SQL statement that selects data from a database DB2 (which is in SystemB) and inserts the data in database DB3 (which is part of SystemC). Both of these databases will be associated with SystemA, and both will appear under SystemA in the Browse tab pane. Therefore, stitching fails and duplicate databases might appear. To resolve this issue and obtain stitching, you can use this property to map DB2 to its rightful SystemB, and DB3 to its rightful SystemC. { "id": "sql_dir", "type": "SqlDirectory", "path": "/tmp/sqldir", "dialect": "bigquery", "collibraSystemName": "SystemA", "database": "DB", "schema": "SCH", "databaseSystemMapping": { "DB2": "SystemB", "DB3": "SystemC" } }
schema	The name of the default schema, if not specified in the data source itself. This corresponds to name of your Schema asset. Note You must use the same schema name as the name of the Schema asset that you create when you prepare the physical data layer in Data Catalog.
verbose	Indication whether you want to enable verbose logging. By default this is set to `True`. If you don't want to use verbose logging, set it to `False`.
dependentSourceIds	Option to specify data source dependencies for the sharing of database models. Sharing database models allows you to provide table-definition details from an independent data source to a data source that is dependent on those details. This is needed to avoid analysis errors and to have a complete lineage that includes lineage from the SQL statements from dependent data sources. If Database2 is dependent on Database1, include the `dependentSourceIds` property and specify the Source ID of Database1: `"dependentSourceIds": ["<source ID of Database1>"]` If, for example, Database2 is dependent on more than one data source, specify the independent sources, as follows: `"dependentSourceIds": ["<source ID of an independent source>", "<source ID of another independent source>"]` Important If a dependent data source contains lowercase column names, this feature will only work for the following dialects: Oracle, Snowflake, and Teradata. For all other dialects: An analyze error is raised, prompting you to provide the DDL file. The only workaround is to consolidate your SQL statements and DDL file in a single data source. For complete information, go to Sharing database models across data sources.
deleteRawMetadataAfterProcessing	The lineage harvester harvests raw metadata from specified data sources and uploads it in a ZIP file to a Collibra Data Lineage service instance, for processing. You can use this optional property to specify whether or not the raw metadata should be deleted from Collibra Data Lineage service instance after the metadata that is targeted for ingestion in Data Catalog is processed. The default value is `false`. If the property is set to `true`, the raw source metadata is deleted after processing. If set to `false`, it is stored in the Collibra infrastructure.

Properties	Description
general	This section describes the connection between Collibra Data Lineage and Data Catalog.
techlin	This section contains information that is necessary to connect to the Collibra Data Lineage service instance. Warning This section applies only to US government customers.
url	The URL of the Collibra Data Lineage service instance. Example “url”: “https://techlin-gov.collibra.com” Warning This section applies only to US government customers.
userKey	The unique API key to connect to the Collibra Data Lineage service instance. A unique user key is needed for each Collibra environment. If you're not sure what your user key is, please contact your Collibra Collibra Account Team. Warning This section applies only to US government customers.
catalog	This section contains information that is necessary to connect to Data Catalog. Note Versions of the lineage harvester older than 1.1.2 show `collibra` instead of `catalog`.
url	The URL of your Collibra environment. Note Enter the public URL of your Collibra environment. Other URLs are not accepted.
username	The username that you use to sign in to Collibra.
useCollibraSystemName	Indicates whether you want to use the system or server name of a data source to match to the System asset you created when you prepared the Data Catalog physical data layer. The names are case-sensitive. This is useful when you have multiple databases with the same name.
sources	This configuration section contains the required information for a Google BigQuery database.
id	The unique ID of your data source. For example, `my_third_data_source`.
type	The kind of data source. In this case, the value has to be `DatabaseBigQuery`.
projectIDs	The IDs of your Google BigQuery project. You can add multiple projects. For example, `[ "first-project", "second-project", "third-project" ]`. Note You have to use the same project ID as the name of the Database asset that you create when you prepare the physical data layer in Data Catalog.
region	The location of your BigQuery data. This is the region that you specified when you create a data set. If the region that you specify here doesn't match the region you specified when you created a data set, then: The metadata of that data set will not be harvested. Metadata of the data sets in the region you specify here will be harvested. If you don't specify a region, the region is defaulted to US, meaning that metadata (and lineage) will be harvested only for datasets located in the US region. You can only add one location as value. However, you can create separate BigQuery entries per location in the configuration file. As a result, you create a complete technical lineage with Google BigQuery data from different locations. Note This property is optional.
auth	The path to a JSON file that contains authentication information. Tip For more information about setting up the authentication, see the Google Big Query user guide.
collibraSystemName	The name of the Google BigQuery system. This is also the name of your System asset in Data Catalog. Specify this property with the same name as the name of the System asset that you create when you prepare the physical data layer in Data Catalog. If you don't prepare the physical data layer, Collibra Data Lineage cannot stitch the data objects in your technical lineage to the assets in Data Catalog. Specify this property with the same name as the name of the System asset that you created when you registered the data source.
databaseSystemMapping	This optional property allows you to map databases to their rightful systems, to obtain stitching. This resolves missing stitching, which occurs when Collibra Data Lineage associates multiple databases with the default system name that you provide in the `collibraSystemName` property. More information and example configuration Let’s say you use the `collibraSystemName` property to specify the system name SystemA. This system name will be associated with all of the databases mentioned in your SQL statements in your SQL files. Now let’s say that you have a SQL statement that selects data from a database DB2 (which is in SystemB) and inserts the data in database DB3 (which is part of SystemC). Both of these databases will be associated with SystemA, and both will appear under SystemA in the Browse tab pane. Therefore, stitching fails and duplicate databases might appear. To resolve this issue and obtain stitching, you can use this property to map DB2 to its rightful SystemB, and DB3 to its rightful SystemC. { "id": "sql_dir", "type": "SqlDirectory", "path": "/tmp/sqldir", "dialect": "bigquery", "collibraSystemName": "SystemA", "database": "DB", "schema": "SCH", "databaseSystemMapping": { "DB2": "SystemB", "DB3": "SystemC" } }
dependentSourceIds	Option to specify data source dependencies for the sharing of database models. Sharing database models allows you to provide table-definition details from an independent data source to a data source that is dependent on those details. This is needed to avoid analysis errors and to have a complete lineage that includes lineage from the SQL statements from dependent data sources. If Database2 is dependent on Database1, include the `dependentSourceIds` property and specify the Source ID of Database1: `"dependentSourceIds": ["<source ID of Database1>"]` If, for example, Database2 is dependent on more than one data source, specify the independent sources, as follows: `"dependentSourceIds": ["<source ID of an independent source>", "<source ID of another independent source>"]` Important If a dependent data source contains lowercase column names, this feature will only work for the following dialects: Oracle, Snowflake, and Teradata. For all other dialects: An analyze error is raised, prompting you to provide the DDL file. The only workaround is to consolidate your SQL statements and DDL file in a single data source. For complete information, go to Sharing database models across data sources.
deleteRawMetadataAfterProcessing	The lineage harvester harvests raw metadata from specified data sources and uploads it in a ZIP file to a Collibra Data Lineage service instance, for processing. You can use this optional property to specify whether or not the raw metadata should be deleted from Collibra Data Lineage service instance after the metadata that is targeted for ingestion in Data Catalog is processed. The default value is `false`. If the property is set to `true`, the raw source metadata is deleted after processing. If set to `false`, it is stored in the Collibra infrastructure.

Save the configuration file.

For complete information on creating custom technical lineage by using the lineage harvester, go to Working with custom technical lineage.

Steps

Open the lineage-harvester.conf file that was created when you installed the lineage harvester, and enter the values for each property.

Properties	Description
general	This section describes the connection between Collibra Data Lineage and Data Catalog.
techlin	This section contains information that is necessary to connect to the Collibra Data Lineage service instance. Warning This section applies only to US government customers.
url	The URL of the Collibra Data Lineage service instance. Example “url”: “https://techlin-gov.collibra.com” Warning This section applies only to US government customers.
userKey	The unique API key to connect to the Collibra Data Lineage service instance. A unique user key is needed for each Collibra environment. If you're not sure what your user key is, please contact your Collibra Collibra Account Team. Warning This section applies only to US government customers.
catalog	This section contains information that is necessary to connect to Data Catalog. Note Versions of the lineage harvester older than 1.1.2 show `collibra` instead of `catalog`.
url	The URL of your Collibra environment. Note Enter the public URL of your Collibra environment. Other URLs are not accepted.
username	The username that you use to sign in to Collibra.
useCollibraSystemName	The lineage harvester ignores this property for custom technical lineage. To use the system or server name of your data source to match the System asset in Data Catalog, specify the system data object in: The `tree` and `lineage` sections of your custom technical lineage JSON file, if you use the single-file definition method. Your assets or lineage JSON files, if you use the batch definition method.
sources	Contains the required information to retrieve a custom lineage. Use this property to locate the JSON file that defines the custom technical lineage. If you want to create the technical lineage for multiple data sources, create a `sources` section for each data source.
type	The kind of data source. The value must be `ExternalDirectory`.
id	This property is used to identify the batch of harvested metadata on the Collibra Data Lineage service instance. The value can be anything as long as it is unique and human readable. For example, `MyCustomLineage`.
dirType	The type of external directory. The value is `custom-lineage`.
collibraSystemName	The lineage harvester ignores this property for custom technical lineage. To use the system or server name of your data source to match the System asset in Data Catalog, specify the system data object in: The `tree` and `lineage` sections of your custom technical lineage JSON file, if you use the single-file definition method. Your assets or lineage JSON files, if you use the batch definition method.
path	The full path to the folder of the custom technical lineage JSON file, for example `C:\path\to\custom-lineage\dir`. If you are using the single-file definition method, there can be only one JSON file that defines the lineage, and the JSON file must be named lineage.json. You can, however, add other files in the harvested directory and subdirectories and refer to those files from within the JSON file.
dependentSourceIds	Option to specify data source dependencies for the sharing of database models. Sharing database models allows you to provide table-definition details from an independent data source to a data source that is dependent on those details. This is needed to avoid analysis errors and to have a complete lineage that includes lineage from the SQL statements from dependent data sources. If Database2 is dependent on Database1, include the `dependentSourceIds` property and specify the Source ID of Database1: `"dependentSourceIds": ["<source ID of Database1>"]` If, for example, Database2 is dependent on more than one data source, specify the independent sources, as follows: `"dependentSourceIds": ["<source ID of an independent source>", "<source ID of another independent source>"]` Important If a dependent data source contains lowercase column names, this feature will only work for the following dialects: Oracle, Snowflake, and Teradata. For all other dialects: An analyze error is raised, prompting you to provide the DDL file. The only workaround is to consolidate your SQL statements and DDL file in a single data source. For complete information, go to Sharing database models across data sources.
deleteRawMetadataAfterProcessing	The lineage harvester harvests raw metadata from specified data sources and uploads it in a ZIP file to a Collibra Data Lineage service instance, for processing. You can use this optional property to specify whether or not the raw metadata should be deleted from Collibra Data Lineage service instance after the metadata that is targeted for ingestion in Data Catalog is processed. The default value is `false`. If the property is set to `true`, the raw source metadata is deleted after processing. If set to `false`, it is stored in the Collibra infrastructure.

Save the configuration file.

Steps

Open the lineage-harvester.conf file that was created when you installed the lineage harvester, and enter the values for each property.

Properties	Description
general	This section describes the connection between Collibra Data Lineage and Data Catalog.
techlin	This section contains information that is necessary to connect to the Collibra Data Lineage service instance. Warning This section applies only to US government customers.
url	The URL of the Collibra Data Lineage service instance. Example “url”: “https://techlin-gov.collibra.com” Warning This section applies only to US government customers.
userKey	The unique API key to connect to the Collibra Data Lineage service instance. A unique user key is needed for each Collibra environment. If you're not sure what your user key is, please contact your Collibra Collibra Account Team. Warning This section applies only to US government customers.
catalog	This section contains information that is necessary to connect to Data Catalog. Note Versions of the lineage harvester older than 1.1.2 show `collibra` instead of `catalog`.
url	The URL of your Collibra environment. Note Enter the public URL of your Collibra environment. Other URLs are not accepted.
username	The username that you use to sign in to Collibra.
useCollibraSystemName	Indicates whether or not you want to use the system or server name of a JDBC data source to match the System asset that you created when you prepared the physical data layer. The names are case-sensitive. Specify one of the following values: `false` The lineage harvester ignores all system or server names that you specify on the `collibraSystemName` properties in the <source ID> file. This is the default value. `true` The lineage harvester reads the system and server names that you specify on the `collibraSystemName` properties in all sections of the <source ID> file. Note Specify this property with the value of `true` only when you have multiple databases with the same name.
sources	This configuration section contains the required information to connect to IBM InfoSphere DataStage. Note Make sure that you have prepared a local folder with the DataStage files for which you want to create a technical lineage.
collibraSystemName (Deprecated)	This property is deprecated. If you specify a value for this property, it is ignored. To override the default Collibra System asset name, use the `useCollibraSystemName` property in the source id file.
id	This property is used to identify the batch of harvested metadata on the Collibra Data Lineage service instance. The value can be anything as long as it is unique and human readable. For example, my_datastage.
type	The kind of data source. In this case, the value has to be ExternalDirectory.
dirType	The type of external directory. The value has to be `datastage`.
path	The full path to the folder where you stored the data source, for example, `C:\path\to\config\dir`.
mask	The pattern of the file names in the directory. By default, this is `*`.
recursive	Indication whether you want to use recursive queries. By default, this is set to `False`. If you want to use recursive query, set it to `True`.
dependentSourceIds	Option to specify data source dependencies for the sharing of database models. Sharing database models allows you to provide table-definition details from an independent data source to a data source that is dependent on those details. This is needed to avoid analysis errors and to have a complete lineage that includes lineage from the SQL statements from dependent data sources. If Database2 is dependent on Database1, include the `dependentSourceIds` property and specify the Source ID of Database1: `"dependentSourceIds": ["<source ID of Database1>"]` If, for example, Database2 is dependent on more than one data source, specify the independent sources, as follows: `"dependentSourceIds": ["<source ID of an independent source>", "<source ID of another independent source>"]` Important If a dependent data source contains lowercase column names, this feature will only work for the following dialects: Oracle, Snowflake, and Teradata. For all other dialects: An analyze error is raised, prompting you to provide the DDL file. The only workaround is to consolidate your SQL statements and DDL file in a single data source. For complete information, go to Sharing database models across data sources.
deleteRawMetadataAfterProcessing	The lineage harvester harvests raw metadata from specified data sources and uploads it in a ZIP file to a Collibra Data Lineage service instance, for processing. You can use this optional property to specify whether or not the raw metadata should be deleted from Collibra Data Lineage service instance after the metadata that is targeted for ingestion in Data Catalog is processed. The default value is `false`. If the property is set to `true`, the raw source metadata is deleted after processing. If set to `false`, it is stored in the Collibra infrastructure.

Save the configuration file.

Steps

Open the lineage-harvester.conf file that was created when you installed the lineage harvester, and enter the values for each property.

Properties	Description	Required?
general	This section describes the connection between Collibra lineage and Data Catalog.	Yes
techlin	This section contains information that is necessary to connect to the Collibra Data Lineage service instance. Warning This section applies only to US government customers.	Yes
url	The URL of the Collibra Data Lineage service instance. Example “url”: “https://techlin-gov.collibra.com” Warning This section applies only to US government customers.	Yes
userKey	The unique API key to connect to the Collibra Data Lineage service instance. A unique user key is needed for each Collibra environment. If you're not sure what your user key is, please contact your Collibra Collibra Account Team. Warning This section applies only to US government customers.	Yes
catalog	This section contains information that is necessary to connect to Data Catalog. Note Versions of the lineage harvester older than 1.1.2 show `collibra` instead of `catalog`.	Yes
url	The URL of your Collibra environment. Note Enter the public URL of your Collibra environment. Other URLs are not accepted.	Yes
username	The username that you use to sign in to Collibra.	Yes
useCollibraSystemName	Indicates whether or not you want to use the system or server name of a JDBC data source to match the System asset that you created when you prepared the physical data layer. The names are case-sensitive. Specify one of the following values: `false` The lineage harvester ignores all system or server names that you specify on the `collibraSystemName` properties in the <source ID> file. This is the default value. `true` The lineage harvester reads the system and server names that you specify on the `collibraSystemName` properties in all sections of the <source ID> file. Note Specify this property with the value of `true` only when you have multiple databases with the same name.	No
sources	This section describes the data sources for which you want to create the technical lineage. You have to create a configuration section for each data source. This configuration section contains the required information of one individual SQL directory with connection type "Folder". Note You can add multiple data sources to the same configuration file.	Yes
id	This property is used to identify the batch of harvested metadata on the Collibra Data Lineage service instance. The value can be anything as long as it is unique and human readable.	Yes
type	The kind of data source. The value must be `dbt`.	Yes
collibraSystemName (Deprecated)	This property is deprecated. If you specify a value for this property, it is ignored. To override the default Collibra System asset name, use the `useCollibraSystemName` property in the source id file.	No
tokenName	The name of the service token. It can be any unique meaningful name. When you run the lineage harvester, you will be prompted for a token. Enter the token value for the service token. How to get a service token and token value. Generate a Service token and ensure that you set the Read-Only permissions for CollibraData Lineage to work properly. Copy the token value when you save the service token. For details, go to Generating service account tokens in dbt documentation.	Yes
adminUrl	The dbt Cloud Administrative API that Collibra Data Lineage uses to download job artifacts. The default value is `https://cloud.getdbt.com/api/v2`. This property is used if you do not specify the `environmentIds` property. If you specify both the `adminUrl` and `environmentIds` properties, the `environmentIds` property takes precedence.	No
environmentIds	The IDs of the environments that Collibra Data Lineage uses to download job artifacts. Specify this property with an array of environment IDs, for example `[123456, 987654]`. This property is required if you do not specify the `adminUrl` property. If you specify both the `adminUrl` and `environmentIds` properties, the `environmentIds` property takes precedence.	No
metadataUrl	The dbt Cloud Discovery API. The default value is `https://metadata.cloud.getdbt.com/graphql`. For details, go to Query the Discovery API in dbt documentation.	No
dependentSourceIds	Option to specify data source dependencies for the sharing of database models. Sharing database models allows you to provide table-definition details from an independent data source to a data source that is dependent on those details. This is needed to avoid analysis errors and to have a complete lineage that includes lineage from the SQL statements from dependent data sources. If Database2 is dependent on Database1, include the `dependentSourceIds` property and specify the Source ID of Database1: `"dependentSourceIds": ["<source ID of Database1>"]` If, for example, Database2 is dependent on more than one data source, specify the independent sources, as follows: `"dependentSourceIds": ["<source ID of an independent source>", "<source ID of another independent source>"]` Important If a dependent data source contains lowercase column names, this feature will only work for the following dialects: Oracle, Snowflake, and Teradata. For all other dialects: An analyze error is raised, prompting you to provide the DDL file. The only workaround is to consolidate your SQL statements and DDL file in a single data source. For complete information, go to Sharing database models across data sources.
deleteRawMetadataAfterProcessing	The lineage harvester harvests raw metadata from specified data sources and uploads it in a ZIP file to a Collibra Data Lineage service instance, for processing. You can use this optional property to specify whether or not the raw metadata should be deleted from Collibra Data Lineage service instance after the metadata that is targeted for ingestion in Data Catalog is processed. The default value is `false`. If the property is set to `true`, the raw source metadata is deleted after processing. If set to `false`, it is stored in the Collibra infrastructure.

Save the configuration file.

Steps

Open the lineage-harvester.conf file that was created when you installed the lineage harvester, and enter the values for each property.

Properties	Description
general	This section describes the connection between Collibra Data Lineage and Data Catalog.
techlin	This section contains information that is necessary to connect to the Collibra Data Lineage service instance. Warning This section applies only to US government customers.
url	The URL of the Collibra Data Lineage service instance. Example “url”: “https://techlin-gov.collibra.com” Warning This section applies only to US government customers.
userKey	The unique API key to connect to the Collibra Data Lineage service instance. A unique user key is needed for each Collibra environment. If you're not sure what your user key is, please contact your Collibra Collibra Account Team. Warning This section applies only to US government customers.
catalog	This section contains information that is necessary to connect to Data Catalog. Note Versions of the lineage harvester older than 1.1.2 show `collibra` instead of `catalog`.
url	The URL of your Collibra environment. Note Enter the public URL of your Collibra environment. Other URLs are not accepted.
username	The username that you use to sign in to Collibra.
useCollibraSystemName	Indicates whether or not you want to use the system or server name of a JDBC data source to match the System asset that you created when you prepared the physical data layer. The names are case-sensitive. Specify one of the following values: `false` The lineage harvester ignores all system or server names that you specify on the `collibraSystemName` properties in the <source ID> file. This is the default value. `true` The lineage harvester reads the system and server names that you specify on the `collibraSystemName` properties in all sections of the <source ID> file. Note Specify this property with the value of `true` only when you have multiple databases with the same name.
sources	This configuration section contains the required information to enable the lineage harvester to collect and process Data Integration objects. You can create different Informatica Intelligent Cloud Services <source ID> configuration files for a large data source to avoid errors that might occur when the lineage harvester ingests metadata from one source with a large size. You can then decrease the size of the source by separating the projects to a different source with a different <source ID> configuration file name. Show me the example "sources" : [ { "type" : "IICS", "id" : "iics_source-1", "loginUrl" : "https://dm-us.informaticaintelligentcloud.com", "username" : "login-iics" "objects" : [ { "path" : "Default/Sales", "type" : "Project" }, { "path" : "My Project/Statistics", "type" : "Project" } ] } { "type" : "IICS", "id" : "iics_source-2", "loginUrl" : "https://dm-us.informaticaintelligentcloud.com", "username" : "login-iics" "objects" : [ { "path" : "Finance/Task_Flows", "type" : "Folder" }, { "path" : "Common/Task_Flows/tf_CalendarDimension", "type" : "Taskflow" } ] } ] Tip Make sure you have READ permission on all data objects that you want to harvest.
type	The kind of data source. In this case, the value has to be `IICS`.
id	This property is used to identify the batch of harvested metadata on the Collibra Data Lineage service instance. The value can be anything as long as it is unique and human readable. For example, `my_data_integration`.
collibraSystemName (Deprecated)	This property is deprecated. If you specify a value for this property, it is ignored. To override the default Collibra System asset name, use the `useCollibraSystemName` property in the source id file.
loginURL	The URL of the Informatica Intelligent Cloud Services environment sign-in page. For example: `https://dm-us.informaticaintelligentcloud.com`.
username	The username you use to sign in to Informatica Intelligent Cloud Services.
objects	The objects that you want to retrieve. Each object requires a path and a type, for example: Example The following example retrieves the Project, Folder, Taskflow, and Workflow objects. [ { "path" : "Sales", "type" : "Project" }, { "path" : "Finance/Task_Flows", "type" : "Folder" }, { "path" : "Common/Task_Flows/tf_CalendarDimension", "type" : "Taskflow" }, { "path" : "Common/Linear_Task_Flows/wf_StateProvinceDimension", "type" : "Workflow" } ] Tip For more information about the objects that you can export and the required information, see the Informatica documentation.
path	The full path to the object, for example, `C:\path\to\object-dir`.
type	The type of the object, for example, Taskflow. IICS scanner's starting point is a Taskflow or Linear Taskflow (Workflow). Therefore the only meaningful types to retrieve are: Taskflow, Workflow, Project and Folder. The types are not case sensitive.
paramFiles	The full path to the directory in which your parameter files are stored. This is an optional parameter that allows you to harvest parameter files in Informatica Intelligent Cloud Services data sources. Important The hierarchy of the files in the directory must be an exact match of the hierarchy of the files in your file system. Show me how to do this Create a directory for your parameter files. For this example, let's name the directory my-parameter-files. In your lineage harvester configuration file, the value of the `paramFiles` property needs to be the full path to your parameter files directory, for example `/full/path/<my-parameter-files>/`. Copy your parameter files to your parameter files directory. Be sure to preserve the full path for each of your parameter files. For example, for parameter file /root/child/child2/paramfile.txt, run the following commands: `cd /full/path/<my-parameter-files>/` `mkdir -p root/child/child2/` `cp /root/child/child2/paramfile.txt root/child/child2/`
dependentSourceIds	Option to specify data source dependencies for the sharing of database models. Sharing database models allows you to provide table-definition details from an independent data source to a data source that is dependent on those details. This is needed to avoid analysis errors and to have a complete lineage that includes lineage from the SQL statements from dependent data sources. If Database2 is dependent on Database1, include the `dependentSourceIds` property and specify the Source ID of Database1: `"dependentSourceIds": ["<source ID of Database1>"]` If, for example, Database2 is dependent on more than one data source, specify the independent sources, as follows: `"dependentSourceIds": ["<source ID of an independent source>", "<source ID of another independent source>"]` Important If a dependent data source contains lowercase column names, this feature will only work for the following dialects: Oracle, Snowflake, and Teradata. For all other dialects: An analyze error is raised, prompting you to provide the DDL file. The only workaround is to consolidate your SQL statements and DDL file in a single data source. For complete information, go to Sharing database models across data sources.
deleteRawMetadataAfterProcessing	The lineage harvester harvests raw metadata from specified data sources and uploads it in a ZIP file to a Collibra Data Lineage service instance, for processing. You can use this optional property to specify whether or not the raw metadata should be deleted from Collibra Data Lineage service instance after the metadata that is targeted for ingestion in Data Catalog is processed. The default value is `false`. If the property is set to `true`, the raw source metadata is deleted after processing. If set to `false`, it is stored in the Collibra infrastructure.

Save the configuration file.

Steps

Open the lineage-harvester.conf file that was created when you installed the lineage harvester, and enter the values for each property.

Properties	Description
general	This section describes the connection between Collibra Data Lineage and Data Catalog.
techlin	This section contains information that is necessary to connect to the Collibra Data Lineage service instance. Warning This section applies only to US government customers.
url	The URL of the Collibra Data Lineage service instance. Example “url”: “https://techlin-gov.collibra.com” Warning This section applies only to US government customers.
userKey	The unique API key to connect to the Collibra Data Lineage service instance. A unique user key is needed for each Collibra environment. If you're not sure what your user key is, please contact your Collibra Collibra Account Team. Warning This section applies only to US government customers.
catalog	This section contains information that is necessary to connect to Data Catalog. Note Versions of the lineage harvester older than 1.1.2 show `collibra` instead of `catalog`.
url	The URL of your Collibra environment. Note Enter the public URL of your Collibra environment. Other URLs are not accepted.
username	The username that you use to sign in to Collibra.
useCollibraSystemName	Indicates whether or not you want to use the system or server name of a JDBC data source to match the System asset that you created when you prepared the physical data layer. The names are case-sensitive. Specify one of the following values: `false` The lineage harvester ignores all system or server names that you specify on the `collibraSystemName` properties in the <source ID> file. This is the default value. `true` The lineage harvester reads the system and server names that you specify on the `collibraSystemName` properties in all sections of the <source ID> file. Note Specify this property with the value of `true` only when you have multiple databases with the same name.
sources	This configuration section contains the required information to connect to Informatica PowerCenter. Note Make sure that you have prepared a local folder with the Informatica objects for which you want to create a technical lineage.
collibraSystemName	The name of the data source's system or server. This is also the name of your System asset in Data Catalog. Use this property with the `useCollibraSystemName` property in the lineage harvester configuration file to override the default Collibra System asset name for this data source. Specify this property with the same name as the name of the System asset that you create when you prepare the physical data layer in Data Catalog. If you don't prepare the physical data layer, Collibra Data Lineage cannot stitch the data objects in your technical lineage to the assets in Data Catalog. The following rules apply when you specify the `collibrasystemname` properties in this file and the source ID file: If you specify the `collibrasystemname` property for a database or connection in the source ID file, the value in the source ID file overrides the value of this property for that database or connection. For any databases or connections that do not have a Collibra system name specified in the source ID file, the value of this property is used as a global value.
id	This property is used to identify the batch of harvested metadata on the Collibra Data Lineage service instance. The value can be anything as long as it is unique and human readable. For example, `my_informatica`.
type	The kind of data source. In this case, the value has to be ExternalDirectory.
dirType	The type of external directory. The value must be `powercenter`.
path	The full path to the folder where you stored the data source, for example, `C:\path\to\config\dir`.
mask	The pattern of the file names in the directory. By default, this is `*`.
recursive	Indicates whether to use recursive queries. Specify one of the following values: `False` The lineage harvester collects only the files in the folder specified by the `path` property. Files in subfolders of that folder are not collected. This is the default value. `True` The lineage harvester collects files in the folder specified by the `path` property and also files in its subfolders. Use this value if the folder specified by the `path` property contains subfolders.
dependentSourceIds	Option to specify data source dependencies for the sharing of database models. Sharing database models allows you to provide table-definition details from an independent data source to a data source that is dependent on those details. This is needed to avoid analysis errors and to have a complete lineage that includes lineage from the SQL statements from dependent data sources. If Database2 is dependent on Database1, include the `dependentSourceIds` property and specify the Source ID of Database1: `"dependentSourceIds": ["<source ID of Database1>"]` If, for example, Database2 is dependent on more than one data source, specify the independent sources, as follows: `"dependentSourceIds": ["<source ID of an independent source>", "<source ID of another independent source>"]` Important If a dependent data source contains lowercase column names, this feature will only work for the following dialects: Oracle, Snowflake, and Teradata. For all other dialects: An analyze error is raised, prompting you to provide the DDL file. The only workaround is to consolidate your SQL statements and DDL file in a single data source. For complete information, go to Sharing database models across data sources.
deleteRawMetadataAfterProcessing	The lineage harvester harvests raw metadata from specified data sources and uploads it in a ZIP file to a Collibra Data Lineage service instance, for processing. You can use this optional property to specify whether or not the raw metadata should be deleted from Collibra Data Lineage service instance after the metadata that is targeted for ingestion in Data Catalog is processed. The default value is `false`. If the property is set to `true`, the raw source metadata is deleted after processing. If set to `false`, it is stored in the Collibra infrastructure.

Save the configuration file.

Steps

Open the lineage-harvester.conf file that was created when you installed the lineage harvester, and enter the values for each property.

Properties	Description
general	This section describes the connection information between the lineage harvester and Data Catalog.
techlin	This section contains information that is necessary to connect to the Collibra Data Lineage service instance. Warning This applies only for Collibra Platform for Government customers.
url	The URL of the Collibra Data Lineage service instance.“url”: “https://techlin-gov.collibra.com” Warning This applies only for Collibra Platform for Government customers.
userKey	The unique API key to connect to the Collibra Data Lineage service instance. A unique user key is needed for each Collibra environment. If you're not sure what your user key is, please contact your Collibra Collibra Account Team. Warning This applies only for Collibra Platform for Government customers.
catalog	This section contains information that is necessary to connect to Data Catalog.
url	The URL of your Collibra Platform environment. Note You can only enter the public URL of your Collibra DGC environment. Other URLs will not be accepted.
username	The username that you use to sign in to Collibra.
useCollibraSystemName	Indicates whether or not you want to use the system or server name of a data source to match to the System asset in Data Catalog. Collibra Data Lineage uses the system names to match the structure of databases in Looker to assets in Data Catalog. This is useful when you have multiple databases with the same name. By default, the `useCollibraSystemName` property is set to `false`. If you want to use it, set it to `true`. Important If you set this property to `true`, the lineage harvester reads the value of the `collibraSystemName` property in your Looker <source-ID> configuration file. If you set the `useCollibraSystemName` property to `false`, the lineage harvester ignores the `collibraSystemName` property in the Looker <source-ID> configuration file.
sources	This section contains the Looker connection properties.
id	This property is used to identify the batch of harvested metadata on the Collibra Data Lineage service instance. The value can be anything as long as it is unique and human readable. Warning In the `sources` section of your lineage harvester configuration file, you can only specify one `id` property per Looker instance. If you have multiple `id` properties for a single Looker instance, ingestion will fail. If you have multiple `id` properties in the configuration file, it means you intend to ingest from multiple unique Looker instances. Warning If you are switching between the lineage harvester and Edge, the value of this property must exactly match the value of the Source ID field in your Edge capacity.
type	The kind of data source. In this case, the value has to be Looker.
lookerUrl	The URL to your Looker API. Tip There are two ways to find the Looker API URL: In the API Host URL field in the Looker Admin menu. If this field is empty, you can use the default Looker API URL which you can find in the interactive API documentation. In the interactive API documentation URL. It is the part of the URL before `/api-docs/`. Note Looker 3.1 APIs are deprecated; however, the API3 credentials for authorization and access control remain valid.
clientId	The username you use to access the Looker API.
domainId	The unique ID of the domain in Collibra Platform in which you want to ingest the Looker assets. This is the default domain. If you want to ingest the contents of specific Looker Folders into specific domains in Collibra, you specify the domain reference IDs in the filters section of the Looker <source ID> configuration file.
pagingLimit	Optional property for customizing the Looker API pagination settings. The default value of "50" is sufficient in most cases; however, you can decrease it to help mitigate node limit errors, or increase it to speed up API calls. Note The paging limit option is known to cause issues when used with Looker Core instances. If you experience issues, for example a `Received RST_STREAM: Protocol error`, we recommend disabling pagination by setting the value to "0". Example `"pagingLimit": 10`
concurrencyLevel	This optional property is intended to help if you are experiencing HTTP 401 Unauthorized errors due to too many concurrent HTTP calls, using the same token. It allows you to specify the internal sizing, meaning the amount of tasks that can be executed at the same time. The default value is "15", meaning as many as 15 HTTP requests can take place in parallel. Consider reducing the value if you are experiencing HTTP 401 Unauthorized errors. Setting the value to "1" effectively disables the concurrency level, so that HTTP requests will be run in a synchronous manner, instead of in parallel. Example `"concurrencyLevel": 5`
connectionTimeoutSeconds	This optional property is intended to help avoid timeout errors, when the lineage harvester attempts to connect to your Looker instance. The default value is "30", meaning a timeout error is thrown if a connection is not established within 30 seconds. If timeout errors persist, try adding this property to you lineage harvester configuration file and setting the value to `60` or `90`. Example `"connectionTimeoutSeconds": 60`
dependentSourceIds	Option to specify data source dependencies for the sharing of database models. Sharing database models allows you to provide table-definition details from an independent data source to a data source that is dependent on those details. This is needed to avoid analysis errors and to have a complete lineage that includes lineage from the SQL statements from dependent data sources. If Database2 is dependent on Database1, include the `dependentSourceIds` property and specify the Source ID of Database1: `"dependentSourceIds": ["<source ID of Database1>"]` If, for example, Database2 is dependent on more than one data source, specify the independent sources, as follows: `"dependentSourceIds": ["<source ID of an independent source>", "<source ID of another independent source>"]` Important If a dependent data source contains lowercase column names, this feature will only work for the following dialects: Oracle, Snowflake, and Teradata. For all other dialects: An analyze error is raised, prompting you to provide the DDL file. The only workaround is to consolidate your SQL statements and DDL file in a single data source. For complete information, go to Sharing database models across data sources.
deleteRawMetadataAfterProcessing	The lineage harvester harvests raw metadata from specified data sources and uploads it in a ZIP file to a Collibra Data Lineage service instance, for processing. You can use this optional property to specify whether or not the raw metadata should be deleted from Collibra Data Lineage service instance after the metadata that is targeted for ingestion in Data Catalog is processed. The default value is `false`. If the property is set to `true`, the raw source metadata is deleted after processing. If set to `false`, it is stored in the Collibra infrastructure.

Save the configuration file.

Steps

Open the lineage-harvester.conf file that was created when you installed the lineage harvester, and enter the values for each property.

Properties	Description
general	This section describes the connection between Collibra Data Lineage and Data Catalog.
techlin	This section contains information that is necessary to connect to the Collibra Data Lineage service instance. Warning This section applies only to US government customers.
url	The URL of the Collibra Data Lineage service instance. Example “url”: “https://techlin-gov.collibra.com” Warning This section applies only to US government customers.
userKey	The unique API key to connect to the Collibra Data Lineage service instance. A unique user key is needed for each Collibra environment. If you're not sure what your user key is, please contact your Collibra Collibra Account Team. Warning This section applies only to US government customers.
catalog	This section contains information that is necessary to connect to Data Catalog. Note Versions of the lineage harvester older than 1.1.2 show `collibra` instead of `catalog`.
url	The URL of your Collibra environment. Note Enter the public URL of your Collibra environment. Other URLs are not accepted.
username	The username that you use to sign in to Collibra.
useCollibraSystemName	Indicates whether or not you want to use the system or server name of a JDBC data source to match the System asset that you created when you prepared the physical data layer. The names are case-sensitive. Specify one of the following values: `false` The lineage harvester ignores all system or server names that you specify on the `collibraSystemName` properties in the <source ID> file. This is the default value. `true` The lineage harvester reads the system and server names that you specify on the `collibraSystemName` properties in all sections of the <source ID> file. Note Specify this property with the value of `true` only when you have multiple databases with the same name.
sources	This section contains the required information for Matillion. Tip When you create a new project in Matillion, you define in which group you want to create the project, the project name and the environment name. This information is needed to enable the lineage harvester to access Matillion and scan your metadata. Important Currently, you can only create a technical lineage for Snowflake and Redshift projects in Matillion.
id	This property is used to identify the batch of harvested metadata on the Collibra Data Lineage service instance. The value can be anything as long as it is unique and human readable. For example, `my_matillion_data_integration`.
type	The kind of data source. In this case, the value has to be `Matillion`.
url	The URL of your Matillion environment. For example, `https://<domain name>` or `https://<IP address>`.
groupName	The name of your group in Matillion.
projectName	The name of your project in Matillion. You can only add the name of one project. If you want to create a technical lineage for other projects within the same group, create a new section in the lineage harvester configuration file.
environmentName	The name of your environment in Matillion. You can only add the name of one environment. If you want to create a technical lineage for other environments within the same project, create a new section in the lineage harvester configuration file.
dialect	The dialect of the database. See the list of allowed values. You can enter one of the following values: `azure`, for an Azure SQL Server data source. `bigquery`, for a Google BigQuery data source. `db2`, for an IBM DB2 data source. `hana`, for an SAP HANA data source. `hana-cviews`, for getting lineage from calculated views in an SAP HANA Classic on-premises data source. `hana-cviews-v2`, for getting lineage from calculated views in an SAP HANA Cloud/Advanced data source. Important To get technical lineage including calculated views, you must harvest SAP HANA by specifying two data sources in the lineage harvester configuration file. In one data source, specify the `hana` dialect, and in the other, specify the `hana-cviews` or `hana-cviews-v2` dialect. `hive`, for a HiveQL data source. `greenplum`, for a Greenplum data source. `mssql`, for a Microsoft SQL Server data source. `mysql`, for a MySQL data source. `netezza`, for a Netezza data source. `oracle`, for an Oracle data source. `postgres`, for a PostgreSQL data source. `redshift`, for an Amazon Redshift data source. `snowflake`, for a Snowflake data source. `spark`, for a Spark SQL data source. `sybase`, for a Sybase data source. `teradata`, for a Teradata data source.
startTimestamp	The timestamp of tasks in Matillion. You can use this parameter to limit the amount of metadata that the lineage harvester scans. Specify this property with a UNIX timestamp in milliseconds. If this property remains empty or is deleted from the configuration file, all accessible tasks are scanned. Matillion provides seven days of history by default and automatically removes entries older than seven days.
httpTimeout	Sets the HTTP timeout duration in seconds. You can enter a value in the range of 0 to 3600. The default value is 15
collibraSystemName (Deprecated)	This property is deprecated. If you specify a value for this property, it is ignored. To override the default Collibra System asset name, use the `useCollibraSystemName` property in the source id file.
auth	The section contains the authentication details for signing in to Matillion.
type	The authentication method you want to use to sign in to Matillion. The value must be either: `Basic`, for username and password authentication. `Token`, for token-based authentication. Important These values are case-sensitive.
username	The username that you use to sign in to Matillion. Important This property is only required if you are using the username and password authentication method. If you are using token-based authentication, do not include this property.
dependentSourceIds	Option to specify data source dependencies for the sharing of database models. Sharing database models allows you to provide table-definition details from an independent data source to a data source that is dependent on those details. This is needed to avoid analysis errors and to have a complete lineage that includes lineage from the SQL statements from dependent data sources. If Database2 is dependent on Database1, include the `dependentSourceIds` property and specify the Source ID of Database1: `"dependentSourceIds": ["<source ID of Database1>"]` If, for example, Database2 is dependent on more than one data source, specify the independent sources, as follows: `"dependentSourceIds": ["<source ID of an independent source>", "<source ID of another independent source>"]` Important If a dependent data source contains lowercase column names, this feature will only work for the following dialects: Oracle, Snowflake, and Teradata. For all other dialects: An analyze error is raised, prompting you to provide the DDL file. The only workaround is to consolidate your SQL statements and DDL file in a single data source. For complete information, go to Sharing database models across data sources.
deleteRawMetadataAfterProcessing	The lineage harvester harvests raw metadata from specified data sources and uploads it in a ZIP file to a Collibra Data Lineage service instance, for processing. You can use this optional property to specify whether or not the raw metadata should be deleted from Collibra Data Lineage service instance after the metadata that is targeted for ingestion in Data Catalog is processed. The default value is `false`. If the property is set to `true`, the raw source metadata is deleted after processing. If set to `false`, it is stored in the Collibra infrastructure.

Save the configuration file.

Steps

Open the lineage-harvester.conf file that was created when you installed the lineage harvester, and enter the values for each property.

Properties	Description
general	This section describes the connection information between the lineage harvester and Data Catalog.
techlin	This section contains information that is necessary to connect to the Collibra Data Lineage service instance. Warning This applies only for Collibra Platform for Government customers.
url	The URL of the Collibra Data Lineage service instance.“url”: “https://techlin-gov.collibra.com” Warning This applies only for Collibra Platform for Government customers.
userKey	The unique API key to connect to the Collibra Data Lineage service instance. A unique user key is needed for each Collibra environment. If you're not sure what your user key is, please contact your Collibra Collibra Account Team. Warning This applies only for Collibra Platform for Government customers.
catalog	This section contains information that is necessary to connect to Data Catalog.
url	The URL of your Collibra Platform environment. Note You can only enter the public URL of your Collibra DGC environment. Other URLs will not be accepted.
username	The username that you use to sign in to Collibra.
useCollibraSystemName	Indicates whether or not you want to use the system or server name of a data source to match to the System asset in Data Catalog during automatic stitching. This is useful when you have multiple databases with the same name. By default, the `useCollibraSystemName` property is set to `false`. If you want to use it, set it to `true`. Important If you set this property to `true`, the lineage harvester reads the value of the `collibraSystemName` property in your MicroStrategy <source ID> configuration file. If you set the `useCollibraSystemName` property to `false`, the lineage harvester ignores the `collibraSystemName` property in the <source-ID> configuration file.
sources	This section contains the MicroStrategy connection properties.
id	This property is used to identify the batch of harvested metadata on the Collibra Data Lineage service instance. The value can be anything as long as it is unique and human readable. For example, `my_microstrategy`. Warning In the `sources` section of your lineage harvester configuration file, you can only specify one `id` property per MicroStrategy Intelligence Server. If you have multiple `id` properties for a single MicroStrategy Intelligence Server, ingestion will fail. If you have multiple `id` properties in the configuration file, it means you intend to ingest from multiple unique MicroStrategy Intelligence Servers. Warning If you are switching between the lineage harvester and Edge, the value of this property must exactly match the value of the Source ID field in your Edge capacity.
type	The kind of data source. In this case, the value has to be `MSTR_V2`.
url	The URL of your MicroStrategy account.
username	The username that you use to sign in to MicroStrategy.
microStrategyLibraryUrl	If you are using a custom URL to connect to the MicroStrategy Library Server, use this property to specify the custom library URL. Important You only need to specify the URL if both of the following are true: You are connecting to a proxy server. You are not using the default, hardcoded URL to the MicroStrategy Library Server. Example If the URL to your MicroStrategy Library is https://collibra.microstrategy.com/MicroStrategyLibrary/api, you don't need to use this property, as that is the default, hardcoded URL. However, if the URL is something like https://collibra.microstrategy.com/MicroStrategyLibraryProd/api, then use this property and configure it as follows: `"microStrategyLibraryUrl": "MicroStrategyLibraryProd"`
maxParallelRequests	This optional property allows you to specify the internal sizing, meaning the amount of tasks that can be executed at the same time. The default value is "1", which means that HTTP requests are run in a synchronous manner, instead of in parallel. As value of "5", for example, means that as many as 5 HTTP requests can take place in parallel. A lower value reduces the chances of experiencing HTTP 401 Unauthorized errors.
requestTimeoutMs	This optional property allows you to specify the maximum time, in milliseconds (ms), that the MicroStrategy Intelligence Server will wait for a request from the lineage harvester, before closing the connection. Tip A "connection timeout" refers to the amount of time that the lineage harvester will wait for a response from MicroStrategy. A "request timeout" is the converse of a connection timeout. The default value is "30000", or 30 seconds. A higher value reduces the chances of experiencing HTTP 408 Request Timeout errors.
dependentSourceIds	Option to specify data source dependencies for the sharing of database models. Sharing database models allows you to provide table-definition details from an independent data source to a data source that is dependent on those details. This is needed to avoid analysis errors and to have a complete lineage that includes lineage from the SQL statements from dependent data sources. If Database2 is dependent on Database1, include the `dependentSourceIds` property and specify the Source ID of Database1: `"dependentSourceIds": ["<source ID of Database1>"]` If, for example, Database2 is dependent on more than one data source, specify the independent sources, as follows: `"dependentSourceIds": ["<source ID of an independent source>", "<source ID of another independent source>"]` Important If a dependent data source contains lowercase column names, this feature will only work for the following dialects: Oracle, Snowflake, and Teradata. For all other dialects: An analyze error is raised, prompting you to provide the DDL file. The only workaround is to consolidate your SQL statements and DDL file in a single data source. For complete information, go to Sharing database models across data sources.
deleteRawMetadataAfterProcessing	The lineage harvester harvests raw metadata from specified data sources and uploads it in a ZIP file to a Collibra Data Lineage service instance, for processing. You can use this optional property to specify whether or not the raw metadata should be deleted from Collibra Data Lineage service instance after the metadata that is targeted for ingestion in Data Catalog is processed. The default value is `false`. If the property is set to `true`, the raw source metadata is deleted after processing. If set to `false`, it is stored in the Collibra infrastructure.
appUrlSuffix	This optional property ensures that the correct URL to data objects in MicroStrategy is included on the asset pages of corresponding MicroStrategy assets. The required value depends on which platform you run MicroStrategy: For J2EE, use: `"appUrlSuffix": "MicroStrategy/servlet/mstrWeb"` For .NET, use: `"appUrlSuffix": "MicroStrategy/asp/Main.aspx"`

Save the configuration file.

Steps

Open the lineage-harvester.conf file that was created when you installed the lineage harvester, and enter the values for each property.

Properties	Description
general	This section describes the connection between Collibra Data Lineage and Data Catalog.
techlin	This section contains information that is necessary to connect to the Collibra Data Lineage service instance. Warning This section applies only to US government customers.
url	The URL of the Collibra Data Lineage service instance. Example “url”: “https://techlin-gov.collibra.com” Warning This section applies only to US government customers.
userKey	The unique API key to connect to the Collibra Data Lineage service instance. A unique user key is needed for each Collibra environment. If you're not sure what your user key is, please contact your Collibra Collibra Account Team. Warning This section applies only to US government customers.
catalog	This section contains information that is necessary to connect to Data Catalog. Note Versions of the lineage harvester older than 1.1.2 show `collibra` instead of `catalog`.
url	The URL of your Collibra environment. Note Enter the public URL of your Collibra environment. Other URLs are not accepted.
username	The username that you use to sign in to Collibra.
useCollibraSystemName	Indicates whether or not you want to use the system or server name of a JDBC data source to match the System asset that you created when you prepared the physical data layer. The names are case-sensitive. Specify one of the following values: `false` The lineage harvester ignores all system or server names that you specify on the `collibraSystemName` properties in the configuration file. This is the default value. `true` The lineage harvester reads the system and server names that you specify on the `collibraSystemName` properties in all sections of the configuration file. Only specify this value when you have multiple databases with the same name. Note For SQL data sources, if this property is: `false`, system or server names in table references in analyzed SQL code are ignored. This means that a table that exists in two different systems or servers is identified (either correctly or incorrectly) as a single data object, with a single asset name. `true`, system or server names in table references are considered to be represented by different System assets in Data Catalog. The value of the `collibraSystemName` property is used as the default system or server name.
sources	This section describes the data sources for which you want to create the technical lineage. You have to create a configuration section for each data source. This configuration section contains the required information of one individual SQL directory with connection type "Folder". Note You can add multiple data sources to the same configuration file.
id	This property is used to identify the batch of harvested metadata on the Collibra Data Lineage service instance. The value can be anything as long as it is unique and human readable.
type	The kind of data source. In this case, the value has to be `SqlDirectory`.
path	The full path to the folder where you added SQL files, for example, `C:\path\to\config\dir`.
mask	The pattern of the file names in the directory. By default, this is `*`.
recursive	Indication of the files you want to harvest: `false` (default): Only harvest the files in directly under the folder in the SQL directory path. `true`: Harvest all files under the folder in the SQL directory path and subdirectories.
dialect	The dialect of the database. For example, oracle. The value your put for this property has to match the dialect you provide with in the directory with your SQL files.
database	The name of your database, which is the name of your Database asset. Note You have to use the same database name as the name of the Database asset that you create when you prepare the physical data layer in Data Catalog. The names are case-sensitive. The database and schema names in the SQL statements in your SQL files take precedence over the values that you provide for the `database` and `schema` properties in the lineage harvester configuration file. If your SQL statements contain database and schema names, Collibra Data Lineage uses them for stitching. If your SQL statements do not contain database and schema names, Collibra Data Lineage uses the values of the `database` and `schema` properties in the configuration file for stitching.. For more information, go to Prepare an SQL directory and Automatic stitching for technical lineage.
collibraSystemName	The name of the data source's system or server. This is also the name of your System asset in Data Catalog. Specify this property with the same name as the name of the System asset that you create when you prepare the physical data layer in Data Catalog. If you don't prepare the physical data layer, Collibra Data Lineage cannot stitch the data objects in your technical lineage to the assets in Data Catalog.Specify this property with the same name as the name of the System asset that you created when you registered the data source.
databaseSystemMapping	This optional property allows you to map databases to their rightful systems, to obtain stitching. This resolves missing stitching, which occurs when Collibra Data Lineage associates multiple databases with the default system name that you provide in the `collibraSystemName` property. More information and example configuration Let’s say you use the `collibraSystemName` property to specify the system name SystemA. This system name will be associated with all of the databases mentioned in your SQL statements in your SQL files. Now let’s say that you have a SQL statement that selects data from a database DB2 (which is in SystemB) and inserts the data in database DB3 (which is part of SystemC). Both of these databases will be associated with SystemA, and both will appear under SystemA in the Browse tab pane. Therefore, stitching fails and duplicate databases might appear. To resolve this issue and obtain stitching, you can use this property to map DB2 to its rightful SystemB, and DB3 to its rightful SystemC. { "id": "sql_dir", "type": "SqlDirectory", "path": "/tmp/sqldir", "dialect": "bigquery", "collibraSystemName": "SystemA", "database": "DB", "schema": "SCH", "databaseSystemMapping": { "DB2": "SystemB", "DB3": "SystemC" } }
databaseLinkMapping	If you are using DBLinks, this optional property allows you to configure, per data source, the database and schema to which DBLink points. The configuration format is as follows: `"databaseLinkMapping": {"<dblink_name>": {"database":"<database>","schema":"<schema>"}, ...}` The schema provided here is only taken into consideration if a schema is not explicitly specified in the SQL query. As such, the schema specified here can be considered a default or fallback mapping. Tip If you’re using a DBLink to target another source, you need to share the databasae model between the targeted (independent) source and the dependent source. Use the `dependentSourceIds (in preview)` property to configure that dependency and share the database model. Show examples Let's say that you have two Oracle data sources, A and B. Source A You have the following database configuration: Database_A > Schema_A1 > Table_1 > Column_1, Column_2 Database_A > Schema_A2 > Table_2 > Column_x, Column_y Source B You have DBLinks with names `dblink.example.com`, `dblink2.example.com`, and `dblink3.example.com`, all of which point to Source A. Tip Configure the `dependentSourceIds (in preview)` property, recognizing the fact that Source B is dependent on target (independent) Source A. You want your DBLink aliases configured as follows: dblink.example.com → Database_A > Schema_A1 dblink2.example.com → Database_A > Schema_A1 dblink3.example.com → Database_A > Schema_A2 You specify this in the `databaseLinkMapping` property as follows: "databaseLinkMapping": { "dblink.example.com": {"database":"Database_A","schema":"Schema_A1"}, "dblink2.example.com": {"database":"Database_A","schema":"Schema_A1"}, "dblink3.example.com": {"database":"Database_A","schema":"Schema_A2"} } With this done, the following example queries on Source B are successfully analyzed by the Collibra Data Lineage service instance: select * from [email protected] select * from [email protected] select * from [email protected] Multiple dblinks under the same database scope You can configure multiple dblinks under the same database scope: "databaseLinkMapping": { "dbScope1": { "dblink.example.com": { "database": "Database_A", "schema": "Schema_A1" }, "dblink2.example.com": { "database": "Database_A", "schema": "Schema_A1" }, "dblink3.example.com": { "database": "Database_A", "schema": "Schema_A1" } } } Important If the same DBLink, for example `dblink.example.com`, exists in multiple databases, the formatting shown in the previous example still applies, but you need to enclose it in curly brackets and specify the relevant database, as follows: Basic formatting, as shown in the previous example: `"dblink.example.com": {"database":"Database_A","schema":"Schema_A1"}` Formatting if the DBLink exists in multiple databases and you want to apply it only in a database named "dbScope1": `"dbScope1": {"dblink.example.com": {"database":"Database_A","schema":"Schema_A1"}}` If a DBLink is referenced in multiple mappings, as shown in the following example, the first mapping is used. "dbScope1": { "dblink.example.com": {"database":"DevDB_A","schema":"DevSch_A1"} }, "dblink.example.com": {"database":"Database_A","schema":"Schema_A1"}} In this case, occurrences of `dblink.example.com` in the database named "dbScope1" are mapped to: `"database":"DevDB_A","schema":"DevSch_A1"`
schema	The name of the default schema, if not specified in the data source itself. This corresponds to name of your Schema asset. Note You must use the same schema name as the name of the Schema asset that you create when you prepare the physical data layer in Data Catalog.
verbose	Indication whether you want to enable verbose logging. By default this is set to `True`. If you don't want to use verbose logging, set it to `False`.
dependentSourceIds	Option to specify data source dependencies for the sharing of database models. Sharing database models allows you to provide table-definition details from an independent data source to a data source that is dependent on those details. This is needed to avoid analysis errors and to have a complete lineage that includes lineage from the SQL statements from dependent data sources. If Database2 is dependent on Database1, include the `dependentSourceIds` property and specify the Source ID of Database1: `"dependentSourceIds": ["<source ID of Database1>"]` If, for example, Database2 is dependent on more than one data source, specify the independent sources, as follows: `"dependentSourceIds": ["<source ID of an independent source>", "<source ID of another independent source>"]` Important If a dependent data source contains lowercase column names, this feature will only work for the following dialects: Oracle, Snowflake, and Teradata. For all other dialects: An analyze error is raised, prompting you to provide the DDL file. The only workaround is to consolidate your SQL statements and DDL file in a single data source. For complete information, go to Sharing database models across data sources.
deleteRawMetadataAfterProcessing	The lineage harvester harvests raw metadata from specified data sources and uploads it in a ZIP file to a Collibra Data Lineage service instance, for processing. You can use this optional property to specify whether or not the raw metadata should be deleted from Collibra Data Lineage service instance after the metadata that is targeted for ingestion in Data Catalog is processed. The default value is `false`. If the property is set to `true`, the raw source metadata is deleted after processing. If set to `false`, it is stored in the Collibra infrastructure.

Properties	Description
general	This section describes the connection between Collibra Data Lineage and Data Catalog.
techlin	This section contains information that is necessary to connect to the Collibra Data Lineage service instance. Warning This section applies only to US government customers.
url	The URL of the Collibra Data Lineage service instance. Example “url”: “https://techlin-gov.collibra.com” Warning This section applies only to US government customers.
userKey	The unique API key to connect to the Collibra Data Lineage service instance. A unique user key is needed for each Collibra environment. If you're not sure what your user key is, please contact your Collibra Collibra Account Team. Warning This section applies only to US government customers.
catalog	This section contains information that is necessary to connect to Data Catalog. Note Versions of the lineage harvester older than 1.1.2 show `collibra` instead of `catalog`.
url	The URL of your Collibra environment. Note Enter the public URL of your Collibra environment. Other URLs are not accepted.
username	The username that you use to sign in to Collibra.
useCollibraSystemName	Indicates whether or not you want to use the system or server name of a JDBC data source to match the System asset that you created when you prepared the physical data layer. The names are case-sensitive. This is useful if you have multiple databases with the same name.
sources	This configuration section contains the required information for an Oracle database.
id	The unique ID of your Oracle database. For example, `my_oracle_db`.
type	The kind of data source. In this case, the value has to be `DatabaseOracle`.
hostname	The name of your database host.
username	The username that you use to sign in to your Oracle database.
port	The port number.
sids	One or more System Identifiers (SIDs). An SID is a unique name for an Oracle database instance on a specific host. You can use this property in conjunction with the `databaseNames` property, to preserve stitching. Important You must specify either one or more SIDs via this property, or one or more service names via the `serviceNames` property. You cannot include both properties in the configuration file. Show me examples of how to configure the sids property, with and without the databaseNames property Example 1: You include the `sids` property, but not the `databaseNames` property: { "id": "oracle1", "type": "DatabaseOracle", "hostname": "host_url", "username": "user1", "collibraSystemName": "automation_csn", "port": 1521, "sids": ["sid1", "sid2"] } Result: The database names in the technical lineage will be "sid1" and "sid2". If these don't match with your Database assets in Collibra, then stitching won't work. Example 2: You include the `sids` property and the `databaseNames` property: { "id": "oracle2", "type": "DatabaseOracle", "hostname": "host_url", "username": "user1", "collibraSystemName": "automation_csn", "port": 1521, "sids": ["sid1", "sid2"], "databaseNames": ["db1", "db2"] } Result: The SID "sid1" corresponds to the Database asset name "db1" in Collibra, therefore stitching is preserved. The same is true for SID "sid2" and Database asset name "db2".
serviceNames	One or more service names. A service name is the TNS alias that you give when you remotely connect to your database. You can use this property in conjunction with the `databaseNames` property, to preserve stitching. Important You must specify either one or more service names via this property, or one or more SIDs via the `sids` property. You cannot include both properties in the configuration file. Show me examples of how to configure the serviceNames property, with and without the databaseNames property Example 1: You include the `serviceNames` property, but not the `databaseNames` property: { "id": "oracle3", "type": "DatabaseOracle", "hostname": "host_url", "username": "user1", "collibraSystemName": "automation_csn", "port": 1521, "serviceNames": ["sn1", "sn2"] } Result: The database names in the technical lineage will be "sn1" and "sn2". If these don't match with your Database assets in Collibra, then stitching won't work. Example 2: You include the `serviceNames` property and the `databaseNames` property: { "id": "oracle4", "type": "DatabaseOracle", "hostname": "host_url", "username": "user1", "collibraSystemName": "automation_csn", "port": 1521, "serviceNames": ["sn1", "sn2"], "databaseNames": ["db1", "db2"] } Result: The service name "sn1" corresponds to the Database asset name "db1" in Collibra, therefore stitching is preserved. The same is true for service name "sn2" and Database asset name "db2".
databaseNames	The names of one or more Oracle databases. You can use this optional property in conjunction with the `sids` or `serviceNames` property, to preserve stitching. The value you specify has to match your Database asset (or assets) in Collibra. Enter the Oracle database names between double quotes ("") and put everything between square brackets. If you want to include more than one database, separate them by a comma. For example, ["`MyFirstDatabase`", "`MySecondDatabase`"]. If you use this property, the database names that you specify have to correlate with the databases that you specify in the `sids` or `serviceNames` property. If you don't use this property, the database name in the technical lineage will be the value that you put for the `sids` or `serviceNames` property. Tip For examples of how to configure this property, see the `sids` or `serviceNames` property descriptions and examples.
jdbcUrl	Optional property to override the default JDBC URL used to connect to the database. Use this when you need to use connection properties. Example: `"jdbcUrl": "jdbc:oracle:thin:@db.example.com:1521/orclpdb1"`
collibraSystemName	The name of the data source's system or server. This is also the name of your System asset in Data Catalog. Specify this property with the same name as the name of the System asset that you create when you prepare the physical data layer in Data Catalog. If you don't prepare the physical data layer, Collibra Data Lineage cannot stitch the data objects in your technical lineage to the assets in Data Catalog. Specify this property with the same name as the name of the System asset that you created when you registered the data source. If the `useCollibraSystemName` property is: `false` (default), system or server names in table references in analyzed SQL code are ignored. This means that a table that exists in two different systems or servers is identified (either correctly or incorrectly) as a single data object, with a single asset name. `true`, system or server names in table references are considered to be represented by different System assets in Data Catalog. The value of the `collibraSystemName` field is used as the default system or server name.
databaseSystemMapping	This optional property allows you to map databases to their rightful systems, to obtain stitching. This resolves missing stitching, which occurs when Collibra Data Lineage associates multiple databases with the default system name that you provide in the `collibraSystemName` property. More information and example configuration Let’s say you use the `collibraSystemName` property to specify the system name SystemA. This system name will be associated with all of the databases mentioned in your SQL statements in your SQL files. Now let’s say that you have a SQL statement that selects data from a database DB2 (which is in SystemB) and inserts the data in database DB3 (which is part of SystemC). Both of these databases will be associated with SystemA, and both will appear under SystemA in the Browse tab pane. Therefore, stitching fails and duplicate databases might appear. To resolve this issue and obtain stitching, you can use this property to map DB2 to its rightful SystemB, and DB3 to its rightful SystemC. { "id": "sql_dir", "type": "SqlDirectory", "path": "/tmp/sqldir", "dialect": "bigquery", "collibraSystemName": "SystemA", "database": "DB", "schema": "SCH", "databaseSystemMapping": { "DB2": "SystemB", "DB3": "SystemC" } }
databaseLinkMapping	If you are using DBLinks, this optional property allows you to configure, per data source, the database and schema to which DBLink points. The configuration format is as follows: `"databaseLinkMapping": {"<dblink_name>": {"database":"<database>","schema":"<schema>"}, ...}` The schema provided here is only taken into consideration if a schema is not explicitly specified in the SQL query. As such, the schema specified here can be considered a default or fallback mapping. Tip If you’re using a DBLink to target another source, you need to share the databasae model between the targeted (independent) source and the dependent source. Use the `dependentSourceIds (in preview)` property to configure that dependency and share the database model. Show examples Let's say that you have two Oracle data sources, A and B. Source A You have the following database configuration: Database_A > Schema_A1 > Table_1 > Column_1, Column_2 Database_A > Schema_A2 > Table_2 > Column_x, Column_y Source B You have DBLinks with names `dblink.example.com`, `dblink2.example.com`, and `dblink3.example.com`, all of which point to Source A. Tip Configure the `dependentSourceIds (in preview)` property, recognizing the fact that Source B is dependent on target (independent) Source A. You want your DBLink aliases configured as follows: dblink.example.com → Database_A > Schema_A1 dblink2.example.com → Database_A > Schema_A1 dblink3.example.com → Database_A > Schema_A2 You specify this in the `databaseLinkMapping` property as follows: "databaseLinkMapping": { "dblink.example.com": {"database":"Database_A","schema":"Schema_A1"}, "dblink2.example.com": {"database":"Database_A","schema":"Schema_A1"}, "dblink3.example.com": {"database":"Database_A","schema":"Schema_A2"} } With this done, the following example queries on Source B are successfully analyzed by the Collibra Data Lineage service instance: select * from [email protected] select * from [email protected] select * from [email protected] Multiple dblinks under the same database scope You can configure multiple dblinks under the same database scope: "databaseLinkMapping": { "dbScope1": { "dblink.example.com": { "database": "Database_A", "schema": "Schema_A1" }, "dblink2.example.com": { "database": "Database_A", "schema": "Schema_A1" }, "dblink3.example.com": { "database": "Database_A", "schema": "Schema_A1" } } } Important If the same DBLink, for example `dblink.example.com`, exists in multiple databases, the formatting shown in the previous example still applies, but you need to enclose it in curly brackets and specify the relevant database, as follows: Basic formatting, as shown in the previous example: `"dblink.example.com": {"database":"Database_A","schema":"Schema_A1"}` Formatting if the DBLink exists in multiple databases and you want to apply it only in a database named "dbScope1": `"dbScope1": {"dblink.example.com": {"database":"Database_A","schema":"Schema_A1"}}` If a DBLink is referenced in multiple mappings, as shown in the following example, the first mapping is used. "dbScope1": { "dblink.example.com": {"database":"DevDB_A","schema":"DevSch_A1"} }, "dblink.example.com": {"database":"Database_A","schema":"Schema_A1"}} In this case, occurrences of `dblink.example.com` in the database named "dbScope1" are mapped to: `"database":"DevDB_A","schema":"DevSch_A1"`
dependentSourceIds	Option to specify data source dependencies for the sharing of database models. Sharing database models allows you to provide table-definition details from an independent data source to a data source that is dependent on those details. This is needed to avoid analysis errors and to have a complete lineage that includes lineage from the SQL statements from dependent data sources. If Database2 is dependent on Database1, include the `dependentSourceIds` property and specify the Source ID of Database1: `"dependentSourceIds": ["<source ID of Database1>"]` If, for example, Database2 is dependent on more than one data source, specify the independent sources, as follows: `"dependentSourceIds": ["<source ID of an independent source>", "<source ID of another independent source>"]` Important If a dependent data source contains lowercase column names, this feature will only work for the following dialects: Oracle, Snowflake, and Teradata. For all other dialects: An analyze error is raised, prompting you to provide the DDL file. The only workaround is to consolidate your SQL statements and DDL file in a single data source. For complete information, go to Sharing database models across data sources.
deleteRawMetadataAfterProcessing	The lineage harvester harvests raw metadata from specified data sources and uploads it in a ZIP file to a Collibra Data Lineage service instance, for processing. You can use this optional property to specify whether or not the raw metadata should be deleted from Collibra Data Lineage service instance after the metadata that is targeted for ingestion in Data Catalog is processed. The default value is `false`. If the property is set to `true`, the raw source metadata is deleted after processing. If set to `false`, it is stored in the Collibra infrastructure.

Save the configuration file.

Steps

Open the lineage-harvester.conf file that was created when you installed the lineage harvester, and enter the values for each property.

Properties	Description
general	This section describes the necessary connection information.
techlin	This section contains information that is necessary to connect to the Collibra Data Lineage service instance. Warning This applies only for Collibra Platform for Government customers.
url	The URL of the Collibra Data Lineage service instance.“url”: “https://techlin-gov.collibra.com” Warning This applies only for Collibra Platform for Government customers.
userKey	The unique API key to connect to the Collibra Data Lineage service instance. A unique user key is needed for each Collibra environment. If you're not sure what your user key is, please contact your Collibra Collibra Account Team. Warning This applies only for Collibra Platform for Government customers.
catalog	This section contains information that is necessary to connect to Data Catalog.
url	The URL of your Collibra environment. Note You can only enter the public URL of your Collibra DGC environment. Other URLs are not accepted.
username	The username that you use to sign in to Collibra.
useCollibraSystemName	Indicates whether or not you want to use the system or server name of a data source to match to the System asset in Data Catalog during automatic stitching. This is useful when you have multiple databases with the same name. By default, the `useCollibraSystemName` property is set to `false`. If you want to use it, set it to `true`. Important If you set this property to `true`, the lineage harvester reads the value of the `collibraSystemName` property in your Power BI <source ID> configuration file. If you set the `useCollibraSystemName` property to `false`, the lineage harvester ignores the `collibraSystemName` property in the Power BI <source-ID> configuration file.
sources	This section describes the data sources for which you want to create the technical lineage. You have to create a configuration section for each data source. Note You can add multiple data sources to the same configuration file, but you can't have multiple `sources` sections that refer to the same tenant.
scope	Optional property that is intended only for customers with a different scope, such as Chinese tenants. Example “scope” : “https://analysis.chinacloudapi.cn/powerbi/api/.default” Important If you are a US government or national cloud Power BI customer, you must include and specify values for both this property and the `apiUrl` property. For complete information, consult Microsoft's documentation on Power BI for US government customers.
apiUrl	The API URL of your Power BI service. The default value is `https://api.powerbi.com`. Important This property is only relevant for US government or national cloud Power BI customers, in which case you must include and specify values for both this property and the `scope` property. For complete information, consult Microsoft's documentation on Power BI for US government customers.
type	The kind of data source. In this case, the value has to be PowerBI.
id	This property is used to identify the batch of harvested metadata on the Collibra Data Lineage service instance. The value can be anything as long as it is unique and human readable. For example, `my_power-bi`. Warning In the `sources` section of your lineage harvester configuration file, you can only specify one `id` property per Power BI service. If you have multiple `id` properties for a single Power BI service, ingestion will fail. If you have multiple `id` properties in the configuration file, it means you intend to ingest from multiple unique Power BI services. Warning If you are switching between the lineage harvester and Edge, the value of this property must exactly match the value of the Source ID field in your Edge capacity.
tenantDomain	The Power BI tenant domain is the domain associated with the Microsoft Azure tenant. It is either a default domain or a custom domain. You can specify this property with one of the following: The appropriate part of the URL, for example collibrapowerbi.onmicrosoft.com. Do not include the http:// part of the URL. The tenant ID, for example eb----1bd****4663. Tip Usually, you can find a list of Power BI tenant or server domains in your Azure Active Directory or in the upper-right menu.
loginFlow	This section describes the authentication information for accessing your Power BI metadata. The lineage harvester supports two authentication methods: service principal, and username and password. For complete information on your authentication options, see Authentication.
type	This depends on the authentication method you use. Service principle: The value should be `ServicePrincipal`. Username and password: The value should be `ResourceOwnerPasswordCredentials`.
applicationId	The unique ID of the Microsoft Azure Application (client) ID.
username	The email address of your Azure Active Directory user. Tip This property only applies if you are using the username and password authentication method.
domainId	The reference ID of the domain in Collibra in which you want to ingest Power BI metadata.
useHttp1	Optional property to use HTTP/1.1 streams, in case file-size limitations are resulting in timeout errors when using the default HTTP/2 streams.
daxParserEnabled	Note This feature is not available on Collibra Platform for Government. Optional property for enabling DAX analysis via Collibra AI. This feature: Allows you to create column-level lineage that includes your calculated columns and measures in Power BI. Enables stitching between calculated columns in the technical lineage and the corresponding Power BI Column assets in Data Catalog. The default value is `false`. To enable DAX analysis, set the value to `"daxParserEnabled": true`. For complete information on DAX analysis, go to DAX analysis via Collibra AI.
dependentSourceIds	Option to specify data source dependencies for the sharing of database models. Sharing database models allows you to provide table-definition details from an independent data source to a data source that is dependent on those details. This is needed to avoid analysis errors and to have a complete lineage that includes lineage from the SQL statements from dependent data sources. If Database2 is dependent on Database1, include the `dependentSourceIds` property and specify the Source ID of Database1: `"dependentSourceIds": ["<source ID of Database1>"]` If, for example, Database2 is dependent on more than one data source, specify the independent sources, as follows: `"dependentSourceIds": ["<source ID of an independent source>", "<source ID of another independent source>"]` Important If a dependent data source contains lowercase column names, this feature will only work for the following dialects: Oracle, Snowflake, and Teradata. For all other dialects: An analyze error is raised, prompting you to provide the DDL file. The only workaround is to consolidate your SQL statements and DDL file in a single data source. For complete information, go to Sharing database models across data sources.
deleteRawMetadataAfterProcessing	The lineage harvester harvests raw metadata from specified data sources and uploads it in a ZIP file to a Collibra Data Lineage service instance, for processing. You can use this optional property to specify whether or not the raw metadata should be deleted from Collibra Data Lineage service instance after the metadata that is targeted for ingestion in Data Catalog is processed. The default value is `false`. If the property is set to `true`, the raw source metadata is deleted after processing. If set to `false`, it is stored in the Collibra infrastructure.

Save the configuration file.

Steps

Open the lineage-harvester.conf file that was created when you installed the lineage harvester, and enter the values for each property.

Properties	Description
general	This section describes the connection between Collibra Data Lineage and Data Catalog.
techlin	This section contains information that is necessary to connect to the Collibra Data Lineage service instance. Warning This section applies only to US government customers.
url	The URL of the Collibra Data Lineage service instance. Example “url”: “https://techlin-gov.collibra.com” Warning This section applies only to US government customers.
userKey	The unique API key to connect to the Collibra Data Lineage service instance. A unique user key is needed for each Collibra environment. If you're not sure what your user key is, please contact your Collibra Collibra Account Team. Warning This section applies only to US government customers.
catalog	This section contains information that is necessary to connect to Data Catalog. Note Versions of the lineage harvester older than 1.1.2 show `collibra` instead of `catalog`.
url	The URL of your Collibra environment. Note Enter the public URL of your Collibra environment. Other URLs are not accepted.
username	The username that you use to sign in to Collibra.
useCollibraSystemName	Indicates whether or not you want to use the system or server name of a JDBC data source to match the System asset that you created when you prepared the physical data layer. The names are case-sensitive. Specify one of the following values: `false` The lineage harvester ignores all system or server names that you specify on the `collibraSystemName` properties in the <source ID> file. This is the default value. `true` The lineage harvester reads the system and server names that you specify on the `collibraSystemName` properties in all sections of the <source ID> file. Note Specify this property with the value of `true` only when you have multiple databases with the same name.
sources	This configuration section contains the required information to connect to SQL Server Integration Services (SSIS). Note Make sure that you have prepared a local folder with the SSIS files for which you want to create a technical lineage.
collibraSystemName (Deprecated)	This property is deprecated. If you specify a value for this property, it is ignored. To override the default Collibra System asset name, use the `useCollibraSystemName` property in the source id file.
id	This property is used to identify the batch of harvested metadata on the Collibra Data Lineage service instance. The value can be anything as long as it is unique and human readable. For example, my_ssis.
type	The kind of data source. In this case, the value has to be ExternalDirectory.
dirType	The type of external directory. The value has to be `ssis`.
path	The full path to the folder where you stored the data source, for example, `C:\path\to\config\dir`.
mask	The pattern of the file names in the directory. By default, this is `*`.
recursive	Indication whether you want to use recursive queries. By default, this is set to `False`. If you want to use recursive query, set it to `True`.
dependentSourceIds	Option to specify data source dependencies for the sharing of database models. Sharing database models allows you to provide table-definition details from an independent data source to a data source that is dependent on those details. This is needed to avoid analysis errors and to have a complete lineage that includes lineage from the SQL statements from dependent data sources. If Database2 is dependent on Database1, include the `dependentSourceIds` property and specify the Source ID of Database1: `"dependentSourceIds": ["<source ID of Database1>"]` If, for example, Database2 is dependent on more than one data source, specify the independent sources, as follows: `"dependentSourceIds": ["<source ID of an independent source>", "<source ID of another independent source>"]` Important If a dependent data source contains lowercase column names, this feature will only work for the following dialects: Oracle, Snowflake, and Teradata. For all other dialects: An analyze error is raised, prompting you to provide the DDL file. The only workaround is to consolidate your SQL statements and DDL file in a single data source. For complete information, go to Sharing database models across data sources.
deleteRawMetadataAfterProcessing	The lineage harvester harvests raw metadata from specified data sources and uploads it in a ZIP file to a Collibra Data Lineage service instance, for processing. You can use this optional property to specify whether or not the raw metadata should be deleted from Collibra Data Lineage service instance after the metadata that is targeted for ingestion in Data Catalog is processed. The default value is `false`. If the property is set to `true`, the raw source metadata is deleted after processing. If set to `false`, it is stored in the Collibra infrastructure.

Save the configuration file.

Steps

Open the lineage-harvester.conf file that was created when you installed the lineage harvester, and enter the values for each property.

Properties	Description
general	This section describes the connection between Collibra Data Lineage and Data Catalog.
techlin	This section contains information that is necessary to connect to the Collibra Data Lineage service instance. Warning This section applies only to US government customers.
url	The URL of the Collibra Data Lineage service instance. Example “url”: “https://techlin-gov.collibra.com” Warning This section applies only to US government customers.
userKey	The unique API key to connect to the Collibra Data Lineage service instance. A unique user key is needed for each Collibra environment. If you're not sure what your user key is, please contact your Collibra Collibra Account Team. Warning This section applies only to US government customers.
catalog	This section contains information that is necessary to connect to Data Catalog. Note Versions of the lineage harvester older than 1.1.2 show `collibra` instead of `catalog`.
url	The URL of your Collibra environment. Note Enter the public URL of your Collibra environment. Other URLs are not accepted.
username	The username that you use to sign in to Collibra.
useCollibraSystemName	Indicates whether or not you want to use the system or server name of a JDBC data source to match the System asset that you created when you prepared the physical data layer. The names are case-sensitive. Specify one of the following values: `false` The lineage harvester ignores all system or server names that you specify on the `collibraSystemName` properties in the configuration file. This is the default value. `true` The lineage harvester reads the system and server names that you specify on the `collibraSystemName` properties in all sections of the configuration file. Only specify this value when you have multiple databases with the same name. Note For SQL data sources, if this property is: `false`, system or server names in table references in analyzed SQL code are ignored. This means that a table that exists in two different systems or servers is identified (either correctly or incorrectly) as a single data object, with a single asset name. `true`, system or server names in table references are considered to be represented by different System assets in Data Catalog. The value of the `collibraSystemName` property is used as the default system or server name.
sources	This section describes the data sources for which you want to create the technical lineage. You have to create a configuration section for each data source. This configuration section contains the required information of one individual SQL directory with connection type "Folder". Note You can add multiple data sources to the same configuration file.
id	This property is used to identify the batch of harvested metadata on the Collibra Data Lineage service instance. The value can be anything as long as it is unique and human readable.
type	The kind of data source. In this case, the value has to be `SqlDirectory`.
path	The full path to the folder where you added SQL files, for example, `C:\path\to\config\dir`.
mask	The pattern of the file names in the directory. By default, this is `*`.
recursive	Indication of the files you want to harvest: `false` (default): Only harvest the files in directly under the folder in the SQL directory path. `true`: Harvest all files under the folder in the SQL directory path and subdirectories.
dialect	The dialect of the database. For example, snowflake. The value your put for this property has to match the dialect you provide with in the directory with your SQL files.
database	The name of your database, which is the name of your Database asset. Note You have to use the same database name as the name of the Database asset that you create when you prepare the physical data layer in Data Catalog. The names are case-sensitive. The database and schema names in the SQL statements in your SQL files take precedence over the values that you provide for the `database` and `schema` properties in the lineage harvester configuration file. If your SQL statements contain database and schema names, Collibra Data Lineage uses them for stitching. If your SQL statements do not contain database and schema names, Collibra Data Lineage uses the values of the `database` and `schema` properties in the configuration file for stitching.. For more information, go to Prepare an SQL directory and Automatic stitching for technical lineage.
collibraSystemName	The name of the data source's system or server. This is also the name of your System asset in Data Catalog. Specify this property with the same name as the name of the System asset that you create when you prepare the physical data layer in Data Catalog. If you don't prepare the physical data layer, Collibra Data Lineage cannot stitch the data objects in your technical lineage to the assets in Data Catalog.
databaseSystemMapping	This optional property allows you to map databases to their rightful systems, to obtain stitching. This resolves missing stitching, which occurs when Collibra Data Lineage associates multiple databases with the default system name that you provide in the `collibraSystemName` property. More information and example configuration Let’s say you use the `collibraSystemName` property to specify the system name SystemA. This system name will be associated with all of the databases mentioned in your SQL statements in your SQL files. Now let’s say that you have a SQL statement that selects data from a database DB2 (which is in SystemB) and inserts the data in database DB3 (which is part of SystemC). Both of these databases will be associated with SystemA, and both will appear under SystemA in the Browse tab pane. Therefore, stitching fails and duplicate databases might appear. To resolve this issue and obtain stitching, you can use this property to map DB2 to its rightful SystemB, and DB3 to its rightful SystemC. { "id": "sql_dir", "type": "SqlDirectory", "path": "/tmp/sqldir", "dialect": "bigquery", "collibraSystemName": "SystemA", "database": "DB", "schema": "SCH", "databaseSystemMapping": { "DB2": "SystemB", "DB3": "SystemC" } }
schema	The name of the default schema, if not specified in the data source itself. This corresponds to name of your Schema asset. Note You must use the same schema name as the name of the Schema asset that you create when you prepare the physical data layer in Data Catalog.
verbose	Indication whether you want to enable verbose logging. By default this is set to `True`. If you don't want to use verbose logging, set it to `False`.
dependentSourceIds	Option to specify data source dependencies for the sharing of database models. Sharing database models allows you to provide table-definition details from an independent data source to a data source that is dependent on those details. This is needed to avoid analysis errors and to have a complete lineage that includes lineage from the SQL statements from dependent data sources. If Database2 is dependent on Database1, include the `dependentSourceIds` property and specify the Source ID of Database1: `"dependentSourceIds": ["<source ID of Database1>"]` If, for example, Database2 is dependent on more than one data source, specify the independent sources, as follows: `"dependentSourceIds": ["<source ID of an independent source>", "<source ID of another independent source>"]` Important If a dependent data source contains lowercase column names, this feature will only work for the following dialects: Oracle, Snowflake, and Teradata. For all other dialects: An analyze error is raised, prompting you to provide the DDL file. The only workaround is to consolidate your SQL statements and DDL file in a single data source. For complete information, go to Sharing database models across data sources.
deleteRawMetadataAfterProcessing	The lineage harvester harvests raw metadata from specified data sources and uploads it in a ZIP file to a Collibra Data Lineage service instance, for processing. You can use this optional property to specify whether or not the raw metadata should be deleted from Collibra Data Lineage service instance after the metadata that is targeted for ingestion in Data Catalog is processed. The default value is `false`. If the property is set to `true`, the raw source metadata is deleted after processing. If set to `false`, it is stored in the Collibra infrastructure.

Properties	Description
general	This section describes the connection between Collibra Data Lineage and Data Catalog.
techlin	This section contains information that is necessary to connect to the Collibra Data Lineage service instance. Warning This section applies only to US government customers.
url	The URL of the Collibra Data Lineage service instance. Example “url”: “https://techlin-gov.collibra.com” Warning This section applies only to US government customers.
userKey	The unique API key to connect to the Collibra Data Lineage service instance. A unique user key is needed for each Collibra environment. If you're not sure what your user key is, please contact your Collibra Collibra Account Team. Warning This section applies only to US government customers.
catalog	This section contains information that is necessary to connect to Data Catalog. Note Versions of the lineage harvester older than 1.1.2 show `collibra` instead of `catalog`.
url	The URL of your Collibra environment. Note Enter the public URL of your Collibra environment. Other URLs are not accepted.
username	The username that you use to sign in to Collibra.
useCollibraSystemName	Indicates whether or not you want to use the system or server name of a JDBC data source to match the System asset that you created when you prepared the physical data layer. The names are case-sensitive. This is useful if you have multiple databases with the same name.
sources	This section contains the Snowflake connection properties. If you want to create the technical lineage for multiple data sources, create a `sources` section for each data source.
id	The unique ID that identifies the data source on a Collibra Data Lineage service instance, for example, `my_snowflake_2`.
type	The type of data source. The value must be `DatabaseSnowflake`.
mode	The Snowflake ingestion methods that Collibra Data Lineage uses to ingest metadata from Snowflake data sources. Specify one of the following values: `SQL` The SQL Snowflake ingestion mode. Collibra Data Lineage creates a column-level technical lineage based on SQL statements. This is the default value. `SQL-API` The SQL-API Snowflake ingestion mode. Collibra Data Lineage creates a column-level technical lineage based on Snowflake schemas and the access history. For more information, go to Technical lineage for Snowflake ingestion methods.
collibraSystemName	Use this property with the `useCollibraSystemName` property in the lineage harvester configuration file to override the default Collibra System asset name for this data source. Specify this property with the same name as the name of the System asset that you create when you prepare the physical data layer in Data Catalog. If you don't prepare the physical data layer, Collibra Data Lineage cannot stitch the data objects in your technical lineage to the assets in Data Catalog.
databaseSystemMapping	This optional property allows you to map databases to their rightful systems, to obtain stitching. This resolves missing stitching, which occurs when Collibra Data Lineage associates multiple databases with the default system name that you provide in the `collibraSystemName` property. More information and example configuration Let’s say you use the `collibraSystemName` property to specify the system name SystemA. This system name will be associated with all of the databases mentioned in your SQL statements in your SQL files. Now let’s say that you have a SQL statement that selects data from a database DB2 (which is in SystemB) and inserts the data in database DB3 (which is part of SystemC). Both of these databases will be associated with SystemA, and both will appear under SystemA in the Browse tab pane. Therefore, stitching fails and duplicate databases might appear. To resolve this issue and obtain stitching, you can use this property to map DB2 to its rightful SystemB, and DB3 to its rightful SystemC. { "id": "sql_dir", "type": "SqlDirectory", "path": "/tmp/sqldir", "dialect": "bigquery", "collibraSystemName": "SystemA", "database": "DB", "schema": "SCH", "databaseSystemMapping": { "DB2": "SystemB", "DB3": "SystemC" } }
auth	This section indicates the authentication details to connect to the Snowflake database. Note The `username` and `auth` properties are mutually exclusive.
type	The authentication method. Specify one of the following values. The values are case-sensitive. `Basic` The username and password authentication method. Specify the `auth.username` property if you use this authentication method. `KeyPair` The key pair authentication method. Specify the `auth.username`, `auth.pathToPrivateKey`, and `auth.usePassword` properties if you use this authentication method.
username	The user name that you use to connect to the Snowflake database. This property is required for both the username and password authentication method and the key pair authentication method.
pathToPrivateKey	The path to your private key file. This property is required if you use the key pair authentication method. Ensure that the private key matches the public key; otherwise, an error occurs indicating that the JWT token is invalid. For more information about the error, go to Snowflake JDBC driver error at login: net.snowflake.client.jdbc.SnowflakeSQLException: JWT token is invalid in Collibra Support Portal.
usePassword	The private key file password. This property is required if you use the key pair authentication method. Specify one of the following values: `true` The password is required. `false` The password is not required. This is the default value.
username	The username that you use to sign in to your Snowflake data source. Note This property is deprecated. Use the auth property instead. The property and the auth property are mutually exclusive.
hostname	The URL that you use to access Snowflake web console. When you enter the URL, do not include `https://` or the trailing slash (/). For example, specify `<accountName>.snowflakecomputing.com`.
databaseNames	An array of database names. Ensure that the database names you specify match the Database asset names that you created when you prepared the physical data layer in Data Catalog. Enter the database names of your data source between double quotes ("") and put everything between square brackets ([]). If you want to include more than one database, separate them by a comma, for example, ["`MyFirstSnowflakeDatabase`", "`MySecondSnowflakeDatabase`"].
extraDatabaseDefinitions	Important This field is only valid if you're using the SQL-API ingestion method. An array of database names. Collibra Data Lineage collects metadata from the specified databases, but excludes these databases from the technical lineage that is created. This property is useful for stitching across databases. You can specify cross-referenced databases to ensure correct lineage across all databases that Collibra Data Lineage processes to create the technical lineage. This property is optional. To specify this property, enter the database names between double quotes ("") and put everything between square brackets ([]). If you want to include more than one database, separate them by a comma, for example, ["`MyFirstSnowflakeExternalDatabase`", "`MySecondSnowflakeExternalDatabase`"].
schemaNames	An array of schema names of your data sources. This property takes effect only when you use the SQL-API Snowflake ingestion mode. You can use this property as a filter to include lineage for objects only in the specified schemas. Ensure that the schema names you specify match the Schema asset names that you created when you registered the data source in Data Catalog Enter the schema names between double quotes ("") and put everything between square brackets ([]). If you want to include more than one schema, separate them by a comma, for example, ["`MyFirstSnowflakeSchema`", "`MySecondSnowflakeSchema`"].
warehouse	The name of your virtual warehouse. This property is optional.
days	The number of days of the user access history that Collibra Data Lineage collects and processes. For example, if you set the value to 20, Collibra Data Lineage collects the last 20 days of user access history. You can use this property to limit data retrieval from the ACCESS_HISTORY table. This property is optional and takes effect only when you use the SQL-API Snowflake ingestion mode. Specify a value in the range of 1 - 366. If you do not enter a value, all user access history is collected by default. Note A higher value of this property results in Collibra Data Lineage retrieving more data from Snowflake. This might cause a `413 Payload Too Large` error when Collibra Data Lineage analyzes the metadata to create the technical lineage.
customConnectionProperties	An option to enable the lineage harvester to read additional connection parameters. This parameter is only required in very specific situations. If you don't need it, you can remove it from the configuration file. Example If you get an OSCP scan error, you can turn OSCP checking off by using the following value: `insecureMode=true`.
dependentSourceIds	Option to specify data source dependencies for the sharing of database models. Sharing database models allows you to provide table-definition details from an independent data source to a data source that is dependent on those details. This is needed to avoid analysis errors and to have a complete lineage that includes lineage from the SQL statements from dependent data sources. If Database2 is dependent on Database1, include the `dependentSourceIds` property and specify the Source ID of Database1: `"dependentSourceIds": ["<source ID of Database1>"]` If, for example, Database2 is dependent on more than one data source, specify the independent sources, as follows: `"dependentSourceIds": ["<source ID of an independent source>", "<source ID of another independent source>"]` Important If a dependent data source contains lowercase column names, this feature will only work for the following dialects: Oracle, Snowflake, and Teradata. For all other dialects: An analyze error is raised, prompting you to provide the DDL file. The only workaround is to consolidate your SQL statements and DDL file in a single data source. For complete information, go to Sharing database models across data sources.
deleteRawMetadataAfterProcessing	The lineage harvester harvests raw metadata from specified data sources and uploads it in a ZIP file to a Collibra Data Lineage service instance, for processing. You can use this optional property to specify whether or not the raw metadata should be deleted from Collibra Data Lineage service instance after the metadata that is targeted for ingestion in Data Catalog is processed. The default value is `false`. If the property is set to `true`, the raw source metadata is deleted after processing. If set to `false`, it is stored in the Collibra infrastructure.

Save the configuration file.

Steps

Open the lineage-harvester.conf file that was created when you installed the lineage harvester, and enter the values for each property.

Properties Description

general

This section describes the connection information between the lineage harvester and Data Catalog.

techlin

This section contains information that is necessary to connect to the Collibra Data Lineage service instance.

Warning This applies only for Collibra Platform for Government customers.

url

The URL of the Collibra Data Lineage service instance.“url”: “https://techlin-gov.collibra.com”

Warning This applies only for Collibra Platform for Government customers.

userKey

The unique API key to connect to the Collibra Data Lineage service instance.

A unique user key is needed for each Collibra environment. If you're not sure what your user key is, please contact your Collibra Collibra Account Team.

Warning This applies only for Collibra Platform for Government customers.

catalog

This section contains information that is necessary to connect to Data Catalog.

url

The URL of your Collibra Platform environment.

Note You can only enter the public URL of your Collibra Platform environment. Other URLs will not be accepted.

username

The username that you use to sign in to Collibra.

useCollibraSystemName

Indication whether you want to use the system or server name of a data source to match to the System asset you created when you prepared the physical data layer. This is useful when you have multiple databases with the same name.

By default, the useCollibraSystemName property is set to false. If you want to use it, set it to true.

Important

If you set this property to true, the lineage harvester reads the value of the collibraSystemName property in your SSRS-PBRS <source-ID> configuration file.
If you set the useCollibraSystemName property to false, the lineage harvester ignores the collibraSystemName property in the <source-ID> configuration file.

sources This section contains the SSRS connection properties.

This property is used to identify the batch of harvested metadata on the Collibra Data Lineage service instance. The value can be anything as long as it is unique and human readable.

Warning In the sources section of your lineage harvester configuration file, you can only specify one id property per SQL Server Reporting Service (SSRS) or Power BI Report Server (PBRS). If you have multiple id properties for a single SSRS or PBRS, ingestion will fail. If you have multiple id properties in the configuration file, it means you intend to ingest from multiple unique SSRS or PBRS.

Warning If you are switching between the lineage harvester and Edge, the value of this property must exactly match the value of the Source ID field in your Edge capacity.

type

The kind of data source. In this case, the value has to be SSRS or PBIRS.

Note There is no difference between type SSRS or PBIRS.

url

The URL to the server's web portal. By default, the URL is http://<computer-name>/reports. For example, "http://1.23.45.678/PowerBIReports".

username

The username you use to sign in to the web portal.

Tip If you use NTLM authentication, your username also contains the NTLM domain name. For example MyDomain\\username.

domainId

The unique ID of the domain in Collibra Platform in which you want to ingest the assets.

folderFilter

This property allows you to include only specific folders that contain reports or KPIs in the ingestion process.

Important This is a mandatory property and you must provide a value. If you want to ingest all folders, use *, for example: "folderFilter":["*"].

You can filter on multiple folders by:

Specifying folder names.
Specifying the full path to folders.
Using a wildcard.
Using a combination of these approaches. For example: ["folder1", "/database/folder2", /folder3/*"]

Show me some examples

Scenario	Configuration
Ingest all folders with the name Folder3, anywhere in the folder hierarchy.	`"folderFilter":["Folder3"]` Note Reports in child folders of Folder3 are not included in the ingestion. As such: Reports in `/Folder1/Folder2/Folder3` are included in the ingestion. Reports in `/Folder3/ChildFolder` are not included in the ingestion.
Ingest Folder1 and Folder2.	`"folderFilter":["Folder1", "Folder2"]`
Ingest two folders that are both named Folder1.	In this case, specify the full paths to the folders, for example: `"folderFilter":["/Database1/Folder1", "/Database2/Database3/Folder1"]`
Use a wildcard to ingest all child folders of Folder1.	`"folderFilter":["/Folder1/*"]` Note The reports in all child folders of Folder1 are ingested, but the reports in Folder1 itself are not ingested.
Ingest all reports from Folder1 and all of the reports in the child folders of Folder1.	`"folderFilter":["/Folder1/*", "/Folder1"]`

Tip For more information about connecting to a SSRS or PBRS folder, see the Microsoft documentation.

dependentSourceIds

Option to specify data source dependencies for the sharing of database models. Sharing database models allows you to provide table-definition details from an independent data source to a data source that is dependent on those details. This is needed to avoid analysis errors and to have a complete lineage that includes lineage from the SQL statements from dependent data sources.

If Database2 is dependent on Database1, include the dependentSourceIds property and specify the Source ID of Database1:

"dependentSourceIds": ["<source ID of Database1>"]

If, for example, Database2 is dependent on more than one data source, specify the independent sources, as follows:

"dependentSourceIds": ["<source ID of an independent source>", "<source ID of another independent source>"]

Important If a dependent data source contains lowercase column names, this feature will only work for the following dialects: Oracle, Snowflake, and Teradata. For all other dialects:

An analyze error is raised, prompting you to provide the DDL file.
The only workaround is to consolidate your SQL statements and DDL file in a single data source.

For complete information, go to Sharing database models across data sources.

deleteRawMetadataAfterProcessing

The lineage harvester harvests raw metadata from specified data sources and uploads it in a ZIP file to a Collibra Data Lineage service instance, for processing.

You can use this optional property to specify whether or not the raw metadata should be deleted from Collibra Data Lineage service instance after the metadata that is targeted for ingestion in Data Catalog is processed.

The default value is false.

If the property is set to true, the raw source metadata is deleted after processing. If set to false, it is stored in the Collibra infrastructure.

Save the configuration file.

Steps

Open the lineage-harvester.conf file that was created when you installed the lineage harvester, and enter the values for each property.

Properties Description

general

This section describes the connection information between the lineage harvester and Data Catalog.

techlin

This section contains information that is necessary to connect to the Collibra Data Lineage service instance.

Warning This applies only for Collibra Platform for Government customers.

url

The URL of the Collibra Data Lineage service instance.“url”: “https://techlin-gov.collibra.com”

Warning This applies only for Collibra Platform for Government customers.

userKey

The unique API key to connect to the Collibra Data Lineage service instance.

A unique user key is needed for each Collibra environment. If you're not sure what your user key is, please contact your Collibra Collibra Account Team.

Warning This applies only for Collibra Platform for Government customers.

catalog

This section contains information that is necessary to connect to Data Catalog.

url

The URL of your Collibra Platform environment.

Note You can only enter the public URL of your Collibra DGC environment. Other URLs will not be accepted.

username

The username that you use to sign in to Collibra.

useCollibraSystemName

By default, the useCollibraSystemName property is set to false. If you want to use it, set it to true.

Important

If you set this property to true, the lineage harvester reads the value of the collibraSystemName property in your Tableau <source-ID> configuration file.
If you set the useCollibraSystemName property to false, the lineage harvester ignores the collibraSystemName property in the <source-ID> configuration file.

Note If you set the useCollibraSystemName property to true, but you don't define the system name in the Tableau <source ID> configuration file, the system name in the technical lineage is DEFAULT.

type

The kind of data source. In this case, the value has to be Tableau.

sources This section contains the Tableau connection properties.

This property is used to identify the batch of harvested metadata on the Collibra Data Lineage service instance. The value can be anything as long as it is unique and human readable. For example, my_tableau.

Warning In the sources section of your lineage harvester configuration file, you can only specify one id property per Tableau server or Tableau online account. If you have multiple id properties for a single Tableau server or Tableau online account, ingestion will fail. If you have multiple id properties in the configuration file, it means you intend to ingest from multiple unique Tableau servers or Tableau online accounts.

Warning If you are switching between the lineage harvester and Edge, the value of this property must exactly match the value of the Source ID field in your Edge capacity.

url

The link to the data in Tableau.

username

The username you use to sign in to the Tableau server.

Warning As of October 2022, Tableau is enforcing multi-factor authentication for Tableau Cloud Admin users. However, the lineage harvester (deprecated) doesn’t support multi-factor authentication. Therefore, Tableau Cloud users with an Admin role must use token-based authentication. This does not affect Tableau Server users or Tableau Cloud users with an Explorer role.

Important If you want to use token-based authentication, you need to replace username with tokenName. You must specify either username or tokenName; if both exist, then tokenName is used.

tokenName

The lineage harvester authentication token.

Note For token-based authentication, use this property in your lineage harvester configuration file, instead of the username property. If both properties are present, tokenName is used.

siteIds

The site IDs of the Tableau sites that you want to include in the ingestion process.

If you want to ingest the metadata in a Tableau site in a specific domain, specify the following properties:

This property.
The site_name: domain_id property in the filters section in the Tableau <source ID> configuration file.

Important The site ID is the URL of the site to which you want to sign in. When you manually sign in to Tableau Server or Tableau Online, the site ID is the value that appears after /site/ in the browser address bar. In the following example URLs, the site ID is MarketingTeam:

Tableau Server: http://MyServer/#/site/MarketingTeam/projects
Tableau Online: https://10ay.online.tableau.com/#/site/MarketingTeam/workbooks

On Tableau Server, however, the URL of the Default site does not specify the site. For example, the URL for a view named Profits, on a site named Sales, is http://localhost/#/site/sales/views/profits. The URL for this same view on the Default site is http://localhost/#/views/profits. The site name Sales does not figure in the URL. If you can't see the site ID, leave this property empty: "siteIds": [""]

Example If you want to ingest two Tableau sites "Site 1" and "Site 2", you can enter the following information in the siteIds property: ["site ID of Site 1", "site ID of Site 2"].

siteNames

The site names of the corresponding site IDs.

Important This property is:

Optional for Tableau Server
Mandatory for Tableau Online.

Warning If you have Tableau Server and you don't use this property, you must delete it from your configuration file. Don't leave the property in the configuration file without a value.

restOnly

Indication whether or not you would like to use both the Tableau REST API and Tableau Metadata API to harvest Tableau metadata.

false (default): The lineage harvester will use the REST API and Metadata API to harvest Tableau metadata.
true: The lineage harvester will only use the REST API to harvest Tableau metadata.

Note This property must be set to false, to:

Enable technical lineage and the automatic stitching of Column assets to Tableau Data Attribute assets.
Harvest owner information for Tableau projects, workbooks and data models.

domainId

The unique reference ID of the domain in Collibra Platform in which you want to ingest the Tableau assets. This property represents the default domain.

excludeImages

Optional property for excluding the downloading of images.

To exclude the downloading of images, set this property to true.

To indicate the projects that you want to ingest in different domains, specify the filters section in your Tableau <source ID> configuration file.

Note The maximum number of images that can be uploaded to Collibra per day is determined by the configuration of the file upload service, in Collibra Console. For complete details, see the Upload configuration settings in DGC service configuration: options.

concurrencyLevel

This optional property is intended to help if you are experiencing HTTP 401 Unauthorized errors due to too many concurrent HTTP calls, using the same token. It allows you to specify the internal sizing, meaning the amount of tasks that can be executed at the same time.

The default value is "10", meaning as many as 10 HTTP requests can take place in parallel. Consider reducing the value if you are experiencing HTTP 401 Unauthorized errors. Setting the value to "1" effectively disables the concurrency level, so that HTTP requests will be run in a synchronous manner, instead of in parallel.

Example "concurrencyLevel": 5

dependentSourceIds

If Database2 is dependent on Database1, include the dependentSourceIds property and specify the Source ID of Database1:

"dependentSourceIds": ["<source ID of Database1>"]

If, for example, Database2 is dependent on more than one data source, specify the independent sources, as follows:

"dependentSourceIds": ["<source ID of an independent source>", "<source ID of another independent source>"]

Important If a dependent data source contains lowercase column names, this feature will only work for the following dialects: Oracle, Snowflake, and Teradata. For all other dialects:

An analyze error is raised, prompting you to provide the DDL file.
The only workaround is to consolidate your SQL statements and DDL file in a single data source.

For complete information, go to Sharing database models across data sources.

deleteRawMetadataAfterProcessing

The lineage harvester harvests raw metadata from specified data sources and uploads it in a ZIP file to a Collibra Data Lineage service instance, for processing.

The default value is false.

If the property is set to true, the raw source metadata is deleted after processing. If set to false, it is stored in the Collibra infrastructure.

paging

This option allows you to customize the Tableau API pagination settings.

The default values are sufficient in most cases; however, you can decrease them to help mitigate node limit errors, or increase them to speed up API calls.

If the integration fails because of timeout errors due to page sizing limits, Collibra Data Lineage automatically adjusts the limits and retries. For example, if failure occurs with worksheetsPageSize set to 100, the value is automatically reduced to 50 and another integration attempt is automatically started. If it fails again, the value is again halved. If integration is still unsuccessful with an adjusted value of 1, an error is thrown and no further attempts are started. If integration is eventually successful, the page size value is restored to its original value, in this example 100, for the next synchronization.

Show me the complete list of pagination settings, descriptions and default values

"paging": {
	"databasesPageSize": 100,
	"tablesPageSize": 100,
	"tablesColumnsPageSize": 100,
	"tableColumnsPageSize": 1000,
	"datasourcesPageSize": 50,
	"datasourcesFieldsPageSize": 50,
	"datasourceFieldsPageSize": 100,
	"worksheetsPageSize": 100,
	"worksheetsFieldsPageSize": 100,
	"worksheetFieldsPageSize": 1000,
	"parametersPageSize": 1000,
	"usersPageSize": 100,
	"dashboardsPageSize": 100,
	"columnsLimit": 20,
	"fieldsLimit": 20
	}

Settings per metadata type and descriptions

Metadata type	Setting and description
Dashboard	`dashboardsPageSize`: The number of dashboards per page.
Worksheet	`worksheetsPageSize`: The number of worksheets per page. `worksheetsFieldsPageSize`: The number of worksheet fields per page.
Database	`databasesPageSize`: The number of databases per page.
Table	`tablesPageSize`: The number of tables per page. `tablesColumnsPageSize`: The number of table columns per page.
Table columns	`tableColumnsPageSize`: The number of table columns per page.
Parameter	`parametersPageSize` : The number of parameters per page.
Users	`usersPageSize`: The number of users per page.
Data source	`datasourcesPageSize`: The number of data sources per page. `datasourcesFieldsPageSize`: The number of data source fields per page. `columnsLimit`: The number of data source field columns per page. `fieldsLimit` : The number of referenced data source fields per page.
Data source field	`datasourceFieldsPageSize`: The number of data source fields per page. `columnsLimit`: The number of data source field columns per page. `fieldsLimit` : The number of referenced data source fields per page.

Save the configuration file.

Steps

Open the lineage-harvester.conf file that was created when you installed the lineage harvester, and enter the values for each property.

Properties	Description
general	This section describes the connection between Collibra Data Lineage and Data Catalog.
techlin	This section contains information that is necessary to connect to the Collibra Data Lineage service instance. Warning This section applies only to US government customers.
url	The URL of the Collibra Data Lineage service instance. Example “url”: “https://techlin-gov.collibra.com” Warning This section applies only to US government customers.
userKey	The unique API key to connect to the Collibra Data Lineage service instance. A unique user key is needed for each Collibra environment. If you're not sure what your user key is, please contact your Collibra Collibra Account Team. Warning This section applies only to US government customers.
catalog	This section contains information that is necessary to connect to Data Catalog. Note Versions of the lineage harvester older than 1.1.2 show `collibra` instead of `catalog`.
url	The URL of your Collibra environment. Note Enter the public URL of your Collibra environment. Other URLs are not accepted.
username	The username that you use to sign in to Collibra.
useCollibraSystemName	Indicates whether you want to use the system or server name of a data source to match to the System asset you created when you prepared the physical data layer. The names are case-sensitive. This is useful when you have multiple databases with the same name.
sources	This configuration section contains the required information for SQL files of a data source that were previously downloaded by the lineage harvester and is stored in the lineage harvester output folder.
type	The kind of data source. In this case, the value has to be `LoadedSource`.
id	This property is used to identify the batch of harvested metadata on the Collibra Data Lineage service instance. The value can be anything as long as it is unique and human readable. For example, `my_loaded_snowflake_source`.
zipFile	The full path to the ZIP file that was created in the lineage harvester folder.
dependentSourceIds	Option to specify data source dependencies for the sharing of database models. Sharing database models allows you to provide table-definition details from an independent data source to a data source that is dependent on those details. This is needed to avoid analysis errors and to have a complete lineage that includes lineage from the SQL statements from dependent data sources. If Database2 is dependent on Database1, include the `dependentSourceIds` property and specify the Source ID of Database1: `"dependentSourceIds": ["<source ID of Database1>"]` If, for example, Database2 is dependent on more than one data source, specify the independent sources, as follows: `"dependentSourceIds": ["<source ID of an independent source>", "<source ID of another independent source>"]` Important If a dependent data source contains lowercase column names, this feature will only work for the following dialects: Oracle, Snowflake, and Teradata. For all other dialects: An analyze error is raised, prompting you to provide the DDL file. The only workaround is to consolidate your SQL statements and DDL file in a single data source. For complete information, go to Sharing database models across data sources.
deleteRawMetadataAfterProcessing	The lineage harvester harvests raw metadata from specified data sources and uploads it in a ZIP file to a Collibra Data Lineage service instance, for processing. You can use this optional property to specify whether or not the raw metadata should be deleted from Collibra Data Lineage service instance after the metadata that is targeted for ingestion in Data Catalog is processed. The default value is `false`. If the property is set to `true`, the raw source metadata is deleted after processing. If set to `false`, it is stored in the Collibra infrastructure.

Save the configuration file.

What's next

Run the lineage harvester.