Prepare the lineage harvester configuration file

Before you can visualize the technical lineage, you have to create a configuration file for the (meta)data sources that you want to process. This configuration file is used by the lineage harvester to extract data from (meta)data sources for which you want to create a technical lineage or you want to ingest.

If you use multiple lineage harvesters on different servers, you can create a separate configuration file for the lineage harvester on each server and configure different data sources in each configuration file.

Note 
  • Technical lineage supports a limited list of (meta)data sources.
  • In all lineage harvester files, you must use UTF-8 or ISO-8859-1 characters, with the exception of SQL files, which can only be UTF-8 encoded.
  • Each data source has an ID property. The ID string must be unique and human readable. The ID can be anything and is only used to identify the batch of metadata that is processed on the Collibra Data Lineage service.
  • The lineage harvester connects to different Collibra Data Lineage service instances based on your geographical location and cloud provider. Make sure you have the correct system requirements before you run the lineage harvester. If your location or cloud provider changes, the lineage harvester rescans all your data sources.
  • Comments in the lineage harvester configuration file are not supported.
Tip 

Select a data source and the connection type, if relevant, to show the related information.

Currently, information is shown for:

Before you begin

Tip You can use the configuration file generator to create an example configuration file to accommodate the data sources you specify in the generator. You can then copy the example code to your configuration file and replace the values of the properties to suit your needs.

Steps

  1. Open the lineage-harvester.conf file that was created when you installed the lineage harvester, and enter the values for each property.

    Properties

    Description
    general

    This section describes the connection between Collibra Data Lineage and Data Catalog.

    techlin

    This section contains information that is necessary to connect to the Collibra Data Lineage service instance.

    Warning This section applies only to US government customers.

    url

    The URL of the Collibra Data Lineage service instance.

    Example “url”: “https://techlin-gov.collibra.com”

    Warning This section applies only to US government customers.

    userKey

    The unique API key to connect to the Collibra Data Lineage service instance.

    A unique user key is needed for each Collibra environment. If you're not sure what your user key is, please contact your Collibra Customer Success Manager.

    Warning This section applies only to US government customers.

    catalog

    This section contains information that is necessary to connect to Data Catalog.

    Note Versions of the lineage harvester older than 1.1.2 show collibra instead of catalog.

    url

    The URL of your Collibra environment.

    Note Enter the public URL of your Collibra environment. Other URLs are not accepted.

    username

    The username that you use to sign in to Collibra.

    useCollibraSystemName

    Indicates whether or not you want to use the system or server name of a JDBC data source to match the System asset that you created when you prepared the physical data layer. The names are case-sensitive.

    Specify one of the following values:

    false
    The lineage harvester ignores all system or server names that you specify on the collibraSystemName properties in the <source ID> file. This is the default value.
    true
    The lineage harvester reads the system and server names that you specify on the collibraSystemName properties in all sections of the <source ID> file.

    Note Specify this property with the value of true only when you have multiple databases with the same name.

    sources

    This section describes the data sources for which you want to create the technical lineage. You have to create a configuration section for each data source.

    id
    This property is used to identify the batch of harvested metadata on the Collibra Data Lineage service instance. The value can be anything as long as it is unique and human readable. For example, my_adf.
    type

    The type of data source. The value must be AzureDataFactory.

    collibraSystemName (Deprecated)
    This property is deprecated. If you specify a value for this property, it is ignored. To override the default Collibra System asset name, use the useCollibraSystemName property in the source id file.
    tenantDomain
    The directory ID of the Azure Data Factory instance.
    loginFlow
    This section contains the login application information.
    applicationId
    The application ID of the Azure Data Factory instance.
    type
    The identity of the application. The value has to be ServicePrincipal.
    resourceGroupName
    The name of the resource group with the Reader role for the Azure Data Factory instance.
    subscriptionId
    The subscription ID of the resource group.
    factories

    The Azure Data Factory factories that the lineage harvester collects and processes. Specify this property with an array of Azure Data Factory factory names. This property is optional.

    The following rules apply when you specify this property:

    • Enter the factory names in square brackets ([ ]), enclose each factory name in double quotes (" "), and separate them by a comma, for example, ["MyFirstFactory", "MySecondFactory"].
    • The factory name is not case-sensitive. For example, the MyFactory and myfactory factories are considered the same by Azure Data Factory and the lineage harvester.
    • If you do not specify any factory name, the lineage harvester collects and processes all factories that have datasets and piplelines in them.
    dependentSourceIds

    Option to specify data source dependencies for the sharing of database models. Sharing database models allows you to provide table-definition details from an independent data source to a data source that is dependent on those details. This is needed to avoid analysis errors and to have a complete lineage that includes lineage from the SQL statements from dependent data sources. For complete information, go to Sharing database models across data sources.

    If Database2 is dependent on Database1, include the dependentSourceIds property and specify the Source ID of Database1:

    "dependentSourceIds": ["<source ID of Database1>"]

    If, for example, Database2 is dependent on more than one data source, specify the independent sources, as follows:

    "dependentSourceIds": ["<source ID of an independent source>", "<source ID of another independent source>"]

    deleteRawMetadataAfterProcessing

    The lineage harvester harvests raw metadata from specified data sources and uploads it in a ZIP file to a Collibra Data Lineage service instance, for processing.

    You can use this optional property to specify whether or not the raw metadata should be deleted from Collibra Data Lineage service instance after the metadata that is targeted for ingestion in Data Catalog is processed.

    The default value is false.

    If the property is set to true, the raw source metadata is deleted after processing. If set to false, it is stored in the Collibra infrastructure.

  2. Save the configuration file.

Steps

  1. Open the lineage-harvester.conf file that was created when you installed the lineage harvester, and enter the values for each property.

    Properties

    Description
    general

    This section describes the connection between Collibra Data Lineage and Data Catalog.

    techlin

    This section contains information that is necessary to connect to the Collibra Data Lineage service instance.

    Warning This section applies only to US government customers.

    url

    The URL of the Collibra Data Lineage service instance.

    Example “url”: “https://techlin-gov.collibra.com”

    Warning This section applies only to US government customers.

    userKey

    The unique API key to connect to the Collibra Data Lineage service instance.

    A unique user key is needed for each Collibra environment. If you're not sure what your user key is, please contact your Collibra Customer Success Manager.

    Warning This section applies only to US government customers.

    catalog

    This section contains information that is necessary to connect to Data Catalog.

    Note Versions of the lineage harvester older than 1.1.2 show collibra instead of catalog.

    url

    The URL of your Collibra environment.

    Note Enter the public URL of your Collibra environment. Other URLs are not accepted.

    username

    The username that you use to sign in to Collibra.

    useCollibraSystemName

    Indicates whether or not you want to use the system or server name of a JDBC data source to match the System asset that you created when you prepared the physical data layer. The names are case-sensitive.

    Specify one of the following values:

    false
    The lineage harvester ignores all system or server names that you specify on the collibraSystemName properties in the configuration file. This is the default value.
    true
    The lineage harvester reads the system and server names that you specify on the collibraSystemName properties in all sections of the configuration file. Only specify this value when you have multiple databases with the same name.
    Note  For SQL data sources, if this property is:
    • false, system or server names in table references in analyzed SQL code are ignored. This means that a table that exists in two different systems or servers is identified (either correctly or incorrectly) as a single data object, with a single asset name.
    • true, system or server names in table references are considered to be represented by different System assets in Data Catalog. The value of the collibraSystemName property is used as the default system or server name.
    sources

    This section describes the data sources for which you want to create the technical lineage. You have to create a configuration section for each data source. This configuration section contains the required information of one individual SQL directory with connection type "Folder".

    Note You can add multiple data sources to the same configuration file.

    id

    This property is used to identify the batch of harvested metadata on the Collibra Data Lineage service instance. The value can be anything as long as it is unique and human readable.

    type

    The kind of data source. In this case, the value has to be SqlDirectory.

    path

    The full path to the folder where you added SQL files, for example, C:\path\to\config\dir.

    mask

    The pattern of the file names in the directory. By default, this is *.

    recursive

    Indication of the files you want to harvest:

    • false (default): Only harvest the files in directly under the folder in the SQL directory path.
    • true: Harvest all files under the folder in the SQL directory path and subdirectories.
    dialect

    The dialect of the database: redshiftazurebigquerygreenplumhivedb2oraclepostgresmssqlmysqlnetezzasnowflakesybasesparkteradata

    hana, for an SAP HANA data source.

    hana-cviews, for getting lineage from calculated views in an SAP HANA Classic on-premises data source.

    hana-cviews-v2, for getting lineage from calculated views in an SAP HANA Cloud/Advanced data source.

    Important 

    To get technical lineage including calculated views, you must harvest SAP HANA by specifying two data sources in the lineage harvester configuration file. In one data source, specify the hana dialect, and in the other, specify the hana-cviews or hana-cviews-v2 dialect.

    The value your put for this property has to match the dialect you provide with in the directory with your SQL files.

    database

    The name of your database, which is the name of your Database asset.

    Note 
    • You have to use the same database name as the name of the Database asset that you create when you prepare the physical data layer in Data Catalog. The names are case-sensitive.
    • The database and schema names in the SQL statements in your SQL files take precedence over the values that you provide for the database and schema properties in the lineage harvester configuration file. If your SQL statements contain database and schema names, Collibra Data Lineage uses them for stitching. If your SQL statements do not contain database and schema names, Collibra Data Lineage uses the values of the database and schema properties in the configuration file for stitching.. For more information, go to Prepare an SQL directory and Automatic stitching for technical lineage.
    Important 

    HiveQL data sources don't have schemas. Therefore, HiveQL databases are stored in Data Catalog and technical lineage as Schema assets. The technical lineageBrowse tab pane shows the following names:

    • The database name is the name that you enter for the collibraSystemName property.
    • The schema name is the name that you enter for the database property.
    Important 

    MySQL data sources don't have schemas. Therefore, MySQL databases are stored in Data Catalog and technical lineage as Schema assets. The technical lineageBrowse tab pane shows the following names:

    • The database name is the name that you enter for the database property.
    Important 

    Teradata data sources don't have schemas. Therefore, Teradata databases are stored in Data Catalog and technical lineage as Schema assets. The technical lineageBrowse tab pane shows the following names:

    • The database name is the name that you enter for the collibraSystemName property.
    • The schema name is the name that you enter for the database property.
    collibraSystemName

    The name of the data source's system or server. This is also the name of your System asset in Data Catalog.

    Specify this property with the same name as the name of the System asset that you create when you prepare the physical data layer in Data Catalog. If you don't prepare the physical data layer, Collibra Data Lineage cannot stitch the data objects in your technical lineage to the assets in Data Catalog.Specify this property with the same name as the name of the System asset that you created when you registered the data source.

    databaseSystemMapping

    This optional property allows you to map databases to their rightful systems, to obtain stitching. This resolves missing stitching, which occurs when Collibra Data Lineage associates multiple databases with the default system name that you provide in the collibraSystemName property.

    schema

    The name of the default schema, if not specified in the data source itself. This corresponds to name of your Schema asset.

    Note You must use the same schema name as the name of the Schema asset that you create when you prepare the physical data layer in Data Catalog.
    verbose

    Indication whether you want to enable verbose logging.

    By default this is set to True. If you don't want to use verbose logging, set it to False.

    dependentSourceIds

    Option to specify data source dependencies for the sharing of database models. Sharing database models allows you to provide table-definition details from an independent data source to a data source that is dependent on those details. This is needed to avoid analysis errors and to have a complete lineage that includes lineage from the SQL statements from dependent data sources. For complete information, go to Sharing database models across data sources.

    If Database2 is dependent on Database1, include the dependentSourceIds property and specify the Source ID of Database1:

    "dependentSourceIds": ["<source ID of Database1>"]

    If, for example, Database2 is dependent on more than one data source, specify the independent sources, as follows:

    "dependentSourceIds": ["<source ID of an independent source>", "<source ID of another independent source>"]

    deleteRawMetadataAfterProcessing

    The lineage harvester harvests raw metadata from specified data sources and uploads it in a ZIP file to a Collibra Data Lineage service instance, for processing.

    You can use this optional property to specify whether or not the raw metadata should be deleted from Collibra Data Lineage service instance after the metadata that is targeted for ingestion in Data Catalog is processed.

    The default value is false.

    If the property is set to true, the raw source metadata is deleted after processing. If set to false, it is stored in the Collibra infrastructure.

    Properties

    Description
    general

    This section describes the connection between Collibra Data Lineage and Data Catalog.

    techlin

    This section contains information that is necessary to connect to the Collibra Data Lineage service instance.

    Warning This section applies only to US government customers.

    url

    The URL of the Collibra Data Lineage service instance.

    Example “url”: “https://techlin-gov.collibra.com”

    Warning This section applies only to US government customers.

    userKey

    The unique API key to connect to the Collibra Data Lineage service instance.

    A unique user key is needed for each Collibra environment. If you're not sure what your user key is, please contact your Collibra Customer Success Manager.

    Warning This section applies only to US government customers.

    catalog

    This section contains information that is necessary to connect to Data Catalog.

    Note Versions of the lineage harvester older than 1.1.2 show collibra instead of catalog.

    url

    The URL of your Collibra environment.

    Note Enter the public URL of your Collibra environment. Other URLs are not accepted.

    username

    The username that you use to sign in to Collibra.

    useCollibraSystemName

    Indicates whether or not you want to use the system or server name of a JDBC data source to match the System asset that you created when you prepared the physical data layer. The names are case-sensitive.

    Specify one of the following values:

    false
    The lineage harvester ignores all system or server names that you specify on the collibraSystemName properties in the configuration file. This is the default value.
    true
    The lineage harvester reads the system and server names that you specify on the collibraSystemName properties in all sections of the configuration file. Only specify this value when you have multiple databases with the same name.
    Note  For SQL data sources, if this property is:
    • false, system or server names in table references in analyzed SQL code are ignored. This means that a table that exists in two different systems or servers is identified (either correctly or incorrectly) as a single data object, with a single asset name.
    • true, system or server names in table references are considered to be represented by different System assets in Data Catalog. The value of the collibraSystemName property is used as the default system or server name.
    sources

    This section describes the data sources for which you want to create the technical lineage. You have to create a configuration section for each data source.

    This section contains the required information of one individual data source with connection type "JDBC".

    Note You can add multiple data sources to the same configuration file.

    id

    The unique ID of the data source. For example, my_first_data_source.

    type

    The kind of data source. In this case, the value has to be Database.

    username

    The username that you use to sign in to your data source.

    dialect

    The dialect of the database: redshiftazurebigquerygreenplumhivedb2oraclepostgresmssqlmysqlnetezzasnowflakesybasesparkteradata.

    hana, for an SAP HANA data source.

    hana-cviews, for getting lineage from calculated views in an SAP HANA Classic on-premises data source.

    hana-cviews-v2, for getting lineage from calculated views in an SAP HANA Cloud/Advanced data source.

    Important 

    To get technical lineage including calculated views, you must harvest SAP HANA by specifying two data sources in the lineage harvester configuration file. In one data source, specify the hana dialect, and in the other, specify the hana-cviews or hana-cviews-v2 dialect.

    The value your put for this property has to match the dialect you provide with in the directory with your SQL files.

    databaseNames

    The names or IDs of your databases.

    Enter the database names of your data source between double quotes (") and put everything between square brackets. If you want to include more than one database, separate them by a comma. For example, ["MyFirstDatabase", "MySecondDatabase"].

    Note Ensure that you use the same database names as the names of the Database assets. The names are case-sensitive.
    Important 

    HiveQL, Spark SQL, and Teradata are database-less data sources. Therefore, HiveQL, Spark SQL, and Teradata databases are stored in Data Catalog and technical lineage as Schema assets. The technical lineage Browse tab pane shows the following names:

    • The database name is the name that you enter for the externalDbName property.
    • The schema name is the name that you enter for the database property.

    If you do not specify a value for the externalDbName property, Collibra Data Lineage uses the value of the collibraSystemName property as the database name. For details, see the externalDbName property above.

    externalDbName

    This property can be considered a means of database mapping, to help preserve stitching.

    Note This property is relevant only for HiveQL, Spark SQL, and Teradata data sources, specifically because they are database-less data sources.

    You can add the key/value pair to the configuration file, as follows: "externalDbName": "<dbname>", where <dbname> is one of the following values:

    • CData, which Cdata drivers returned as a placeholder. Use this value if you did not create a custom database name by using the CustomizedDefaultCatalogName property when you registered your data source.
    • The custom database name that you specified for the CustomizedDefaultCatalogName property when you registered your data source.

    For more information about the CustomizedDefaultCatalogName connection property, go to Customizing the database name for database-less data sources.

    hostname
    The name of your database host.
    collibraSystemName

    The name of the data source's system or server. This is also the name of your System asset in Data Catalog.

    Specify this property with the same name as the name of the System asset that you create when you prepare the physical data layer in Data Catalog. If you don't prepare the physical data layer, Collibra Data Lineage cannot stitch the data objects in your technical lineage to the assets in Data Catalog.Specify this property with the same name as the name of the System asset that you created when you registered the data source.

    If the useCollibraSystemName property is:

    • false (default), system or server names in table references in analyzed SQL code are ignored. This means that a table that exists in two different systems or servers is identified (either correctly or incorrectly) as a single data object, with a single asset name.
    • true, system or server names in table references are considered to be represented by different System assets in Data Catalog. The value of the collibraSystemName field is used as the default system or server name.

    port

    The port number.

    customConnectionProperties

    An option to enable the lineage harvester to read additional connection parameters. This parameter is only required in very specific situations. If you don't need it, you can remove it from the configuration file.

    dependentSourceIds

    Option to specify data source dependencies for the sharing of database models. Sharing database models allows you to provide table-definition details from an independent data source to a data source that is dependent on those details. This is needed to avoid analysis errors and to have a complete lineage that includes lineage from the SQL statements from dependent data sources. For complete information, go to Sharing database models across data sources.

    If Database2 is dependent on Database1, include the dependentSourceIds property and specify the Source ID of Database1:

    "dependentSourceIds": ["<source ID of Database1>"]

    If, for example, Database2 is dependent on more than one data source, specify the independent sources, as follows:

    "dependentSourceIds": ["<source ID of an independent source>", "<source ID of another independent source>"]

    deleteRawMetadataAfterProcessing

    The lineage harvester harvests raw metadata from specified data sources and uploads it in a ZIP file to a Collibra Data Lineage service instance, for processing.

    You can use this optional property to specify whether or not the raw metadata should be deleted from Collibra Data Lineage service instance after the metadata that is targeted for ingestion in Data Catalog is processed.

    The default value is false.

    If the property is set to true, the raw source metadata is deleted after processing. If set to false, it is stored in the Collibra infrastructure.

  2. Save the configuration file.

Steps

  1. Open the lineage-harvester.conf file that was created when you installed the lineage harvester, and enter the values for each property.

    Properties

    Description Required
    general

    This section describes the connection between Collibra Data Lineage and Data Catalog.

    Yes
    techlin

    This section contains information that is necessary to connect to the Collibra Data Lineage service instance.

    Warning This section applies only to US government customers.

    Yes for US government customers
    url

    The URL of the Collibra Data Lineage service instance.

    Example “url”: “https://techlin-gov.collibra.com”

    Warning This section applies only to US government customers.

    Yes for US government customers
    userKey

    The unique API key to connect to the Collibra Data Lineage service instance.

    A unique user key is needed for each Collibra environment. If you're not sure what your user key is, please contact your Collibra Customer Success Manager.

    Warning This section applies only to US government customers.

    Yes for US government customers
    catalog

    This section contains information that is necessary to connect to Data Catalog.

    Note Versions of the lineage harvester older than 1.1.2 show collibra instead of catalog.

    Yes
    url

    The URL of your Collibra environment.

    Note Enter the public URL of your Collibra environment. Other URLs are not accepted.

    Yes
    username

    The username that you use to sign in to Collibra.

    Yes
    useCollibraSystemName

    Indicates whether or not you want to use the system or server name of a JDBC data source to match the System asset that you created when you prepared the physical data layer. The names are case-sensitive.

    Specify one of the following values:

    false
    The lineage harvester ignores all system or server names that you specify on the collibraSystemName properties in the configuration file. This is the default value.
    true
    The lineage harvester reads the system and server names that you specify on the collibraSystemName properties in all sections of the configuration file. Only specify this value when you have multiple databases with the same name.
    No
    sources

    This configuration section contains the required information of dbt Core data source.

    Note Make sure that you have prepared a local folder with the SQL files and Manifest JSON file for which you want to create a technical lineage.

    Yes
    collibraSystemName

    The system or server name of the data source.

    Use this property with the useCollibraSystemName property in the configuration file to override the default Collibra System asset name for this data source.

    Specify this property with the same name as the name of the System asset that you create when you prepare the physical data layer in Data Catalog. If you don't prepare the physical data layer, Collibra Data Lineage cannot stitch the data objects in your technical lineage to the assets in Data Catalog

    No
    id

    The unique ID that is used to identify the data source on the Collibra Data Lineage service instances. For example, my_dbt.

    Yes
    type

    The kind of data source. In this case, the value has to be ExternalDirectory.

    Yes
    dirType

    The type of external directory. The value has to be dbt.

    Yes
    path

    The full path to the external directory that you created, for example, /opt/dbt/my-project/ or /opt/Collibra/techlin/dbt-core-files. Ensure that the target/ directory is in the external directory.

    Yes
    mask

    The pattern of the file names in the directory. By default, this is *, which sends the SQL and JSON files to the Collibra Data Lineage service instance.

    No
    recursive

    Indication whether you want to use recursive queries.

    You must set the value to true. By default, this is set to false.

    Yes
    dependentSourceIds

    Option to specify data source dependencies for the sharing of database models. Sharing database models allows you to provide table-definition details from an independent data source to a data source that is dependent on those details. This is needed to avoid analysis errors and to have a complete lineage that includes lineage from the SQL statements from dependent data sources. For complete information, go to Sharing database models across data sources.

    If Database2 is dependent on Database1, include the dependentSourceIds property and specify the Source ID of Database1:

    "dependentSourceIds": ["<source ID of Database1>"]

    If, for example, Database2 is dependent on more than one data source, specify the independent sources, as follows:

    "dependentSourceIds": ["<source ID of an independent source>", "<source ID of another independent source>"]

    deleteRawMetadataAfterProcessing

    The lineage harvester harvests raw metadata from specified data sources and uploads it in a ZIP file to a Collibra Data Lineage service instance, for processing.

    You can use this optional property to specify whether or not the raw metadata should be deleted from Collibra Data Lineage service instance after the metadata that is targeted for ingestion in Data Catalog is processed.

    The default value is false.

    If the property is set to true, the raw source metadata is deleted after processing. If set to false, it is stored in the Collibra infrastructure.

    No
  2. Save the configuration file.

Steps

  1. Open the lineage-harvester.conf file that was created when you installed the lineage harvester, and enter the values for each property.

    Properties

    Description
    general

    This section describes the connection between Collibra Data Lineage and Data Catalog.

    techlin

    This section contains information that is necessary to connect to the Collibra Data Lineage service instance.

    Warning This section applies only to US government customers.

    url

    The URL of the Collibra Data Lineage service instance.

    Example “url”: “https://techlin-gov.collibra.com”

    Warning This section applies only to US government customers.

    userKey

    The unique API key to connect to the Collibra Data Lineage service instance.

    A unique user key is needed for each Collibra environment. If you're not sure what your user key is, please contact your Collibra Customer Success Manager.

    Warning This section applies only to US government customers.

    catalog

    This section contains information that is necessary to connect to Data Catalog.

    Note Versions of the lineage harvester older than 1.1.2 show collibra instead of catalog.

    url

    The URL of your Collibra environment.

    Note Enter the public URL of your Collibra environment. Other URLs are not accepted.

    username

    The username that you use to sign in to Collibra.

    useCollibraSystemName

    Indicates whether or not you want to use the system or server name of a JDBC data source to match the System asset that you created when you prepared the physical data layer. The names are case-sensitive.

    Specify one of the following values:

    false
    The lineage harvester ignores all system or server names that you specify on the collibraSystemName properties in the configuration file. This is the default value.
    true
    The lineage harvester reads the system and server names that you specify on the collibraSystemName properties in all sections of the configuration file. Only specify this value when you have multiple databases with the same name.
    Note  For SQL data sources, if this property is:
    • false, system or server names in table references in analyzed SQL code are ignored. This means that a table that exists in two different systems or servers is identified (either correctly or incorrectly) as a single data object, with a single asset name.
    • true, system or server names in table references are considered to be represented by different System assets in Data Catalog. The value of the collibraSystemName property is used as the default system or server name.
    sources

    This section describes the data sources for which you want to create the technical lineage. You have to create a configuration section for each data source. This configuration section contains the required information of one individual SQL directory with connection type "Folder".

    Note You can add multiple data sources to the same configuration file.

    id

    This property is used to identify the batch of harvested metadata on the Collibra Data Lineage service instance. The value can be anything as long as it is unique and human readable.

    type

    The kind of data source. In this case, the value has to be SqlDirectory.

    path

    The full path to the folder where you added SQL files, for example, C:\path\to\config\dir.

    mask

    The pattern of the file names in the directory. By default, this is *.

    recursive

    Indication of the files you want to harvest:

    • false (default): Only harvest the files in directly under the folder in the SQL directory path.
    • true: Harvest all files under the folder in the SQL directory path and subdirectories.
    dialect

    The dialect of the database. For example, bigquery.

    The value your put for this property has to match the dialect you provide with in the directory with your SQL files.

    database

    The name of your database, which is the name of your Database asset.

    Note 
    • You have to use the same database name as the name of the Database asset that you create when you prepare the physical data layer in Data Catalog. The names are case-sensitive.
    • The database and schema names in the SQL statements in your SQL files take precedence over the values that you provide for the database and schema properties in the lineage harvester configuration file. If your SQL statements contain database and schema names, Collibra Data Lineage uses them for stitching. If your SQL statements do not contain database and schema names, Collibra Data Lineage uses the values of the database and schema properties in the configuration file for stitching.. For more information, go to Prepare an SQL directory and Automatic stitching for technical lineage.
    Important 

    MySQL data sources don't have schemas. Therefore, MySQL databases are stored in Data Catalog and technical lineage as Schema assets. The technical lineageBrowse tab pane shows the following names:

    • The database name is the name that you enter for the database property.
    collibraSystemName

    The name of the data source's system or server. This is also the name of your System asset in Data Catalog.

    Specify this property with the same name as the name of the System asset that you create when you prepare the physical data layer in Data Catalog. If you don't prepare the physical data layer, Collibra Data Lineage cannot stitch the data objects in your technical lineage to the assets in Data Catalog.Specify this property with the same name as the name of the System asset that you created when you registered the data source.

    databaseSystemMapping

    This optional property allows you to map databases to their rightful systems, to obtain stitching. This resolves missing stitching, which occurs when Collibra Data Lineage associates multiple databases with the default system name that you provide in the collibraSystemName property.

    schema

    The name of the default schema, if not specified in the data source itself. This corresponds to name of your Schema asset.

    Note You must use the same schema name as the name of the Schema asset that you create when you prepare the physical data layer in Data Catalog.
    verbose

    Indication whether you want to enable verbose logging.

    By default this is set to True. If you don't want to use verbose logging, set it to False.

    dependentSourceIds

    Option to specify data source dependencies for the sharing of database models. Sharing database models allows you to provide table-definition details from an independent data source to a data source that is dependent on those details. This is needed to avoid analysis errors and to have a complete lineage that includes lineage from the SQL statements from dependent data sources. For complete information, go to Sharing database models across data sources.

    If Database2 is dependent on Database1, include the dependentSourceIds property and specify the Source ID of Database1:

    "dependentSourceIds": ["<source ID of Database1>"]

    If, for example, Database2 is dependent on more than one data source, specify the independent sources, as follows:

    "dependentSourceIds": ["<source ID of an independent source>", "<source ID of another independent source>"]

    deleteRawMetadataAfterProcessing

    The lineage harvester harvests raw metadata from specified data sources and uploads it in a ZIP file to a Collibra Data Lineage service instance, for processing.

    You can use this optional property to specify whether or not the raw metadata should be deleted from Collibra Data Lineage service instance after the metadata that is targeted for ingestion in Data Catalog is processed.

    The default value is false.

    If the property is set to true, the raw source metadata is deleted after processing. If set to false, it is stored in the Collibra infrastructure.

    Properties

    Description
    general

    This section describes the connection between Collibra Data Lineage and Data Catalog.

    techlin

    This section contains information that is necessary to connect to the Collibra Data Lineage service instance.

    Warning This section applies only to US government customers.

    url

    The URL of the Collibra Data Lineage service instance.

    Example “url”: “https://techlin-gov.collibra.com”

    Warning This section applies only to US government customers.

    userKey

    The unique API key to connect to the Collibra Data Lineage service instance.

    A unique user key is needed for each Collibra environment. If you're not sure what your user key is, please contact your Collibra Customer Success Manager.

    Warning This section applies only to US government customers.

    catalog

    This section contains information that is necessary to connect to Data Catalog.

    Note Versions of the lineage harvester older than 1.1.2 show collibra instead of catalog.

    url

    The URL of your Collibra environment.

    Note Enter the public URL of your Collibra environment. Other URLs are not accepted.

    username

    The username that you use to sign in to Collibra.

    useCollibraSystemName

    Indicates whether you want to use the system or server name of a data source to match to the System asset you created when you prepared the Data Catalog physical data layer. The names are case-sensitive. This is useful when you have multiple databases with the same name.

    sources This configuration section contains the required information for a Google BigQuery database.
    id

    The unique ID of your data source. For example, my_third_data_source.

    type

    The kind of data source. In this case, the value has to be DatabaseBigQuery.

    projectIDs

    The IDs of your Google BigQuery project. You can add multiple projects. For example, [ "first-project", "second-project", "third-project" ].

    Note You have to use the same project ID as the name of the Database asset that you create when you prepare the physical data layer in Data Catalog.
    region

    The location of your BigQuery data. This is the region that you specified when you create a data set.

    If the region that you specify here doesn't match the region you specified when you created a data set, then:

    • The metadata of that data set will not be harvested.
    • Metadata of the data sets in the region you specify here will be harvested.

    If you don't specify a region, the region is defaulted to US, meaning that metadata (and lineage) will be harvested only for datasets located in the US region.

    You can only add one location as value. However, you can create separate BigQuery entries per location in the configuration file. As a result, you create a complete technical lineage with Google BigQuery data from different locations.

    Note This property is optional.

    auth

    The path to a JSON file that contains authentication information.

    Tip For more information about setting up the authentication, see the Google Big Query user guide.

    collibraSystemName

    The name of the Google BigQuery system. This is also the name of your System asset in Data Catalog.

    Specify this property with the same name as the name of the System asset that you create when you prepare the physical data layer in Data Catalog. If you don't prepare the physical data layer, Collibra Data Lineage cannot stitch the data objects in your technical lineage to the assets in Data Catalog.
    Specify this property with the same name as the name of the System asset that you created when you registered the data source.
    dependentSourceIds

    Option to specify data source dependencies for the sharing of database models. Sharing database models allows you to provide table-definition details from an independent data source to a data source that is dependent on those details. This is needed to avoid analysis errors and to have a complete lineage that includes lineage from the SQL statements from dependent data sources. For complete information, go to Sharing database models across data sources.

    If Database2 is dependent on Database1, include the dependentSourceIds property and specify the Source ID of Database1:

    "dependentSourceIds": ["<source ID of Database1>"]

    If, for example, Database2 is dependent on more than one data source, specify the independent sources, as follows:

    "dependentSourceIds": ["<source ID of an independent source>", "<source ID of another independent source>"]

    deleteRawMetadataAfterProcessing

    The lineage harvester harvests raw metadata from specified data sources and uploads it in a ZIP file to a Collibra Data Lineage service instance, for processing.

    You can use this optional property to specify whether or not the raw metadata should be deleted from Collibra Data Lineage service instance after the metadata that is targeted for ingestion in Data Catalog is processed.

    The default value is false.

    If the property is set to true, the raw source metadata is deleted after processing. If set to false, it is stored in the Collibra infrastructure.

  2. Save the configuration file.

For complete information on creating custom technical lineage by using the lineage harvester, go to Working with custom technical lineage.

Steps

  1. Open the lineage-harvester.conf file that was created when you installed the lineage harvester, and enter the values for each property.

    Properties

    Description
    general

    This section describes the connection between Collibra Data Lineage and Data Catalog.

    techlin

    This section contains information that is necessary to connect to the Collibra Data Lineage service instance.

    Warning This section applies only to US government customers.

    url

    The URL of the Collibra Data Lineage service instance.

    Example “url”: “https://techlin-gov.collibra.com”

    Warning This section applies only to US government customers.

    userKey

    The unique API key to connect to the Collibra Data Lineage service instance.

    A unique user key is needed for each Collibra environment. If you're not sure what your user key is, please contact your Collibra Customer Success Manager.

    Warning This section applies only to US government customers.

    catalog

    This section contains information that is necessary to connect to Data Catalog.

    Note Versions of the lineage harvester older than 1.1.2 show collibra instead of catalog.

    url

    The URL of your Collibra environment.

    Note Enter the public URL of your Collibra environment. Other URLs are not accepted.

    username

    The username that you use to sign in to Collibra.

    useCollibraSystemName
    The lineage harvester ignores this property for custom technical lineage.

    To use the system or server name of your data source to match the System asset in Data Catalog, specify the system data object in:

    sources

    Contains the required information to retrieve a custom lineage. Use this property to locate the JSON file that defines the custom technical lineage.

    If you want to create the technical lineage for multiple data sources, create a sources section for each data source.

    type

    The kind of data source. The value must be ExternalDirectory.

    id

    This property is used to identify the batch of harvested metadata on the Collibra Data Lineage service instance. The value can be anything as long as it is unique and human readable. For example, MyCustomLineage.

    dirType

    The type of external directory. The value is custom-lineage.

    collibraSystemName
    The lineage harvester ignores this property for custom technical lineage.

    To use the system or server name of your data source to match the System asset in Data Catalog, specify the system data object in:

    path

    The full path to the folder of the custom technical lineage JSON file, for example C:\path\to\custom-lineage\dir.

    If you are using the single-file definition method, there can be only one JSON file that defines the lineage, and the JSON file must be named lineage.json. You can, however, add other files in the harvested directory and subdirectories and refer to those files from within the JSON file.

    dependentSourceIds

    Option to specify data source dependencies for the sharing of database models. Sharing database models allows you to provide table-definition details from an independent data source to a data source that is dependent on those details. This is needed to avoid analysis errors and to have a complete lineage that includes lineage from the SQL statements from dependent data sources. For complete information, go to Sharing database models across data sources.

    If Database2 is dependent on Database1, include the dependentSourceIds property and specify the Source ID of Database1:

    "dependentSourceIds": ["<source ID of Database1>"]

    If, for example, Database2 is dependent on more than one data source, specify the independent sources, as follows:

    "dependentSourceIds": ["<source ID of an independent source>", "<source ID of another independent source>"]

    deleteRawMetadataAfterProcessing

    The lineage harvester harvests raw metadata from specified data sources and uploads it in a ZIP file to a Collibra Data Lineage service instance, for processing.

    You can use this optional property to specify whether or not the raw metadata should be deleted from Collibra Data Lineage service instance after the metadata that is targeted for ingestion in Data Catalog is processed.

    The default value is false.

    If the property is set to true, the raw source metadata is deleted after processing. If set to false, it is stored in the Collibra infrastructure.

  2. Save the configuration file.

Steps

  1. Open the lineage-harvester.conf file that was created when you installed the lineage harvester, and enter the values for each property.

    Properties

    Description
    general

    This section describes the connection between Collibra Data Lineage and Data Catalog.

    techlin

    This section contains information that is necessary to connect to the Collibra Data Lineage service instance.

    Warning This section applies only to US government customers.

    url

    The URL of the Collibra Data Lineage service instance.

    Example “url”: “https://techlin-gov.collibra.com”

    Warning This section applies only to US government customers.

    userKey

    The unique API key to connect to the Collibra Data Lineage service instance.

    A unique user key is needed for each Collibra environment. If you're not sure what your user key is, please contact your Collibra Customer Success Manager.

    Warning This section applies only to US government customers.

    catalog

    This section contains information that is necessary to connect to Data Catalog.

    Note Versions of the lineage harvester older than 1.1.2 show collibra instead of catalog.

    url

    The URL of your Collibra environment.

    Note Enter the public URL of your Collibra environment. Other URLs are not accepted.

    username

    The username that you use to sign in to Collibra.

    useCollibraSystemName

    Indicates whether or not you want to use the system or server name of a JDBC data source to match the System asset that you created when you prepared the physical data layer. The names are case-sensitive.

    Specify one of the following values:

    false
    The lineage harvester ignores all system or server names that you specify on the collibraSystemName properties in the <source ID> file. This is the default value.
    true
    The lineage harvester reads the system and server names that you specify on the collibraSystemName properties in all sections of the <source ID> file.

    Note Specify this property with the value of true only when you have multiple databases with the same name.

    sources

    This configuration section contains the required information to connect to IBM InfoSphere DataStage.

    Note Make sure that you have prepared a local folder with the DataStage files for which you want to create a technical lineage.

    collibraSystemName (Deprecated)
    This property is deprecated. If you specify a value for this property, it is ignored. To override the default Collibra System asset name, use the useCollibraSystemName property in the source id file.
    id

    This property is used to identify the batch of harvested metadata on the Collibra Data Lineage service instance. The value can be anything as long as it is unique and human readable. For example, my_datastage.

    type

    The kind of data source. In this case, the value has to be ExternalDirectory.

    dirType

    The type of external directory. The value has to be datastage.

    path

    The full path to the folder where you stored the data source, for example, C:\path\to\config\dir.

    mask

    The pattern of the file names in the directory. By default, this is *.

    recursive

    Indication whether you want to use recursive queries.

    By default, this is set to False. If you want to use recursive query, set it to True.

    dependentSourceIds

    Option to specify data source dependencies for the sharing of database models. Sharing database models allows you to provide table-definition details from an independent data source to a data source that is dependent on those details. This is needed to avoid analysis errors and to have a complete lineage that includes lineage from the SQL statements from dependent data sources. For complete information, go to Sharing database models across data sources.

    If Database2 is dependent on Database1, include the dependentSourceIds property and specify the Source ID of Database1:

    "dependentSourceIds": ["<source ID of Database1>"]

    If, for example, Database2 is dependent on more than one data source, specify the independent sources, as follows:

    "dependentSourceIds": ["<source ID of an independent source>", "<source ID of another independent source>"]

    deleteRawMetadataAfterProcessing

    The lineage harvester harvests raw metadata from specified data sources and uploads it in a ZIP file to a Collibra Data Lineage service instance, for processing.

    You can use this optional property to specify whether or not the raw metadata should be deleted from Collibra Data Lineage service instance after the metadata that is targeted for ingestion in Data Catalog is processed.

    The default value is false.

    If the property is set to true, the raw source metadata is deleted after processing. If set to false, it is stored in the Collibra infrastructure.

  2. Save the configuration file.

Steps

  1. Open the lineage-harvester.conf file that was created when you installed the lineage harvester, and enter the values for each property.

    Properties

    DescriptionRequired?
    general

    This section describes the connection between Collibra lineage and Data Catalog.

    Yes
    techlin

    This section contains information that is necessary to connect to the Collibra Data Lineage service instance.

    Warning This section applies only to US government customers.

    Yes
    url

    The URL of the Collibra Data Lineage service instance.

    Example “url”: “https://techlin-gov.collibra.com”

    Warning This section applies only to US government customers.

    Yes
    userKey

    The unique API key to connect to the Collibra Data Lineage service instance.

    A unique user key is needed for each Collibra environment. If you're not sure what your user key is, please contact your Collibra Customer Success Manager.

    Warning This section applies only to US government customers.

    Yes
    catalog

    This section contains information that is necessary to connect to Data Catalog.

    Note Versions of the lineage harvester older than 1.1.2 show collibra instead of catalog.

    Yes
    url

    The URL of your Collibra environment.

    Note Enter the public URL of your Collibra environment. Other URLs are not accepted.

    Yes
    username

    The username that you use to sign in to Collibra.

    Yes
    useCollibraSystemName

    Indicates whether or not you want to use the system or server name of a JDBC data source to match the System asset that you created when you prepared the physical data layer. The names are case-sensitive.

    Specify one of the following values:

    false
    The lineage harvester ignores all system or server names that you specify on the collibraSystemName properties in the <source ID> file. This is the default value.
    true
    The lineage harvester reads the system and server names that you specify on the collibraSystemName properties in all sections of the <source ID> file.

    Note Specify this property with the value of true only when you have multiple databases with the same name.

    No
    sources

    This section describes the data sources for which you want to create the technical lineage. You have to create a configuration section for each data source. This configuration section contains the required information of one individual SQL directory with connection type "Folder".

    Note You can add multiple data sources to the same configuration file.

    Yes
    id

    This property is used to identify the batch of harvested metadata on the Collibra Data Lineage service instance. The value can be anything as long as it is unique and human readable.

    Yes
    type

    The kind of data source. The value must be dbt.

    Yes
    collibraSystemName (Deprecated)
    This property is deprecated. If you specify a value for this property, it is ignored. To override the default Collibra System asset name, use the useCollibraSystemName property in the source id file. No
    tokenName

    The name of the service token. It can be any unique meaningful name.

    When you run the lineage harvester, you will be prompted for a token. Enter the token value for the service token.

    Yes
    adminUrl
    The dbt Cloud Administrative API that Collibra Data Lineage uses to download job artifacts. The default value is https://cloud.getdbt.com/api/v2.

    This property is used if you do not specify the environmentIds property.

    If you specify both the adminUrl and environmentIds properties, the environmentIds property takes precedence.

    No
    environmentIds

    The IDs of the environments that Collibra Data Lineage uses to download job artifacts.

    Specify this property with an array of environment IDs, for example [123456, 987654]. This property is required if you do not specify the adminUrl property.

    If you specify both the adminUrl and environmentIds properties, the environmentIds property takes precedence.

    No
    metadataUrl

    The dbt Cloud Discovery API. The default value is https://metadata.cloud.getdbt.com/graphql.

    For details, go to Query the Discovery API in dbt documentation.

    No
    dependentSourceIds

    Option to specify data source dependencies for the sharing of database models. Sharing database models allows you to provide table-definition details from an independent data source to a data source that is dependent on those details. This is needed to avoid analysis errors and to have a complete lineage that includes lineage from the SQL statements from dependent data sources. For complete information, go to Sharing database models across data sources.

    If Database2 is dependent on Database1, include the dependentSourceIds property and specify the Source ID of Database1:

    "dependentSourceIds": ["<source ID of Database1>"]

    If, for example, Database2 is dependent on more than one data source, specify the independent sources, as follows:

    "dependentSourceIds": ["<source ID of an independent source>", "<source ID of another independent source>"]

    deleteRawMetadataAfterProcessing

    The lineage harvester harvests raw metadata from specified data sources and uploads it in a ZIP file to a Collibra Data Lineage service instance, for processing.

    You can use this optional property to specify whether or not the raw metadata should be deleted from Collibra Data Lineage service instance after the metadata that is targeted for ingestion in Data Catalog is processed.

    The default value is false.

    If the property is set to true, the raw source metadata is deleted after processing. If set to false, it is stored in the Collibra infrastructure.

     
  2. Save the configuration file.

Steps

  1. Open the lineage-harvester.conf file that was created when you installed the lineage harvester, and enter the values for each property.

    Properties

    Description
    general

    This section describes the connection between Collibra Data Lineage and Data Catalog.

    techlin

    This section contains information that is necessary to connect to the Collibra Data Lineage service instance.

    Warning This section applies only to US government customers.

    url

    The URL of the Collibra Data Lineage service instance.

    Example “url”: “https://techlin-gov.collibra.com”

    Warning This section applies only to US government customers.

    userKey

    The unique API key to connect to the Collibra Data Lineage service instance.

    A unique user key is needed for each Collibra environment. If you're not sure what your user key is, please contact your Collibra Customer Success Manager.

    Warning This section applies only to US government customers.

    catalog

    This section contains information that is necessary to connect to Data Catalog.

    Note Versions of the lineage harvester older than 1.1.2 show collibra instead of catalog.

    url

    The URL of your Collibra environment.

    Note Enter the public URL of your Collibra environment. Other URLs are not accepted.

    username

    The username that you use to sign in to Collibra.

    useCollibraSystemName

    Indicates whether or not you want to use the system or server name of a JDBC data source to match the System asset that you created when you prepared the physical data layer. The names are case-sensitive.

    Specify one of the following values:

    false
    The lineage harvester ignores all system or server names that you specify on the collibraSystemName properties in the <source ID> file. This is the default value.
    true
    The lineage harvester reads the system and server names that you specify on the collibraSystemName properties in all sections of the <source ID> file.

    Note Specify this property with the value of true only when you have multiple databases with the same name.

    sources

    This configuration section contains the required information to enable the lineage harvester to collect and process Data Integration objects.

    You can create different Informatica Intelligent Cloud Services <source ID> configuration files for a large data source to avoid errors that might occur when the lineage harvester ingests metadata from one source with a large size. You can then decrease the size of the source by separating the projects to a different source with a different <source ID> configuration file name.

    Tip Make sure you have READ permission on all data objects that you want to harvest.

    type

    The kind of data source. In this case, the value has to be IICS.

    id

    This property is used to identify the batch of harvested metadata on the Collibra Data Lineage service instance. The value can be anything as long as it is unique and human readable. For example, my_data_integration.

    collibraSystemName (Deprecated)
    This property is deprecated. If you specify a value for this property, it is ignored. To override the default Collibra System asset name, use the useCollibraSystemName property in the source id file.
    loginURL

    The URL of the Informatica Intelligent Cloud Services environment sign-in page. For example: https://dm-us.informaticaintelligentcloud.com.

    username

    The username you use to sign in to Informatica Intelligent Cloud Services.

    objects

    The objects that you want to retrieve. Each object requires a path and a type, for example:

    Tip For more information about the objects that you can export and the required information, see the Informatica documentation.

    path

    The full path to the object, for example, C:\path\to\object-dir.

    type

    The type of the object, for example, Taskflow.

    IICS scanner's starting point is a Taskflow or Linear Taskflow (Workflow). Therefore the only meaningful types to retrieve are: Taskflow, Workflow, Project and Folder.

    The types are not case sensitive.

    paramFiles

    The full path to the directory in which your parameter files are stored.

    This is an optional parameter that allows you to harvest parameter files in Informatica Intelligent Cloud Services data sources.

    Important The hierarchy of the files in the directory must be an exact match of the hierarchy of the files in your file system.
    dependentSourceIds

    Option to specify data source dependencies for the sharing of database models. Sharing database models allows you to provide table-definition details from an independent data source to a data source that is dependent on those details. This is needed to avoid analysis errors and to have a complete lineage that includes lineage from the SQL statements from dependent data sources. For complete information, go to Sharing database models across data sources.

    If Database2 is dependent on Database1, include the dependentSourceIds property and specify the Source ID of Database1:

    "dependentSourceIds": ["<source ID of Database1>"]

    If, for example, Database2 is dependent on more than one data source, specify the independent sources, as follows:

    "dependentSourceIds": ["<source ID of an independent source>", "<source ID of another independent source>"]

    deleteRawMetadataAfterProcessing

    The lineage harvester harvests raw metadata from specified data sources and uploads it in a ZIP file to a Collibra Data Lineage service instance, for processing.

    You can use this optional property to specify whether or not the raw metadata should be deleted from Collibra Data Lineage service instance after the metadata that is targeted for ingestion in Data Catalog is processed.

    The default value is false.

    If the property is set to true, the raw source metadata is deleted after processing. If set to false, it is stored in the Collibra infrastructure.

  2. Save the configuration file.

Steps

  1. Open the lineage-harvester.conf file that was created when you installed the lineage harvester, and enter the values for each property.

    Properties

    Description
    general

    This section describes the connection between Collibra Data Lineage and Data Catalog.

    techlin

    This section contains information that is necessary to connect to the Collibra Data Lineage service instance.

    Warning This section applies only to US government customers.

    url

    The URL of the Collibra Data Lineage service instance.

    Example “url”: “https://techlin-gov.collibra.com”

    Warning This section applies only to US government customers.

    userKey

    The unique API key to connect to the Collibra Data Lineage service instance.

    A unique user key is needed for each Collibra environment. If you're not sure what your user key is, please contact your Collibra Customer Success Manager.

    Warning This section applies only to US government customers.

    catalog

    This section contains information that is necessary to connect to Data Catalog.

    Note Versions of the lineage harvester older than 1.1.2 show collibra instead of catalog.

    url

    The URL of your Collibra environment.

    Note Enter the public URL of your Collibra environment. Other URLs are not accepted.

    username

    The username that you use to sign in to Collibra.

    useCollibraSystemName

    Indicates whether or not you want to use the system or server name of a JDBC data source to match the System asset that you created when you prepared the physical data layer. The names are case-sensitive.

    Specify one of the following values:

    false
    The lineage harvester ignores all system or server names that you specify on the collibraSystemName properties in the <source ID> file. This is the default value.
    true
    The lineage harvester reads the system and server names that you specify on the collibraSystemName properties in all sections of the <source ID> file.

    Note Specify this property with the value of true only when you have multiple databases with the same name.

    sources

    This configuration section contains the required information to connect to Informatica PowerCenter.

    Note Make sure that you have prepared a local folder with the Informatica objects for which you want to create a technical lineage.

    collibraSystemName

    The name of the data source's system or server. This is also the name of your System asset in Data Catalog.

    Use this property with the useCollibraSystemName property in the lineage harvester configuration file to override the default Collibra System asset name for this data source.

    Specify this property with the same name as the name of the System asset that you create when you prepare the physical data layer in Data Catalog. If you don't prepare the physical data layer, Collibra Data Lineage cannot stitch the data objects in your technical lineage to the assets in Data Catalog.

    The following rules apply when you specify the collibrasystemname properties in this file and the source ID file:

    • If you specify the collibrasystemname property for a database or connection in the source ID file, the value in the source ID file overrides the value of this property for that database or connection.
    • For any databases or connections that do not have a Collibra system name specified in the source ID file, the value of this property is used as a global value.
    id

    This property is used to identify the batch of harvested metadata on the Collibra Data Lineage service instance. The value can be anything as long as it is unique and human readable. For example, my_informatica.

    type

    The kind of data source. In this case, the value has to be ExternalDirectory.

    dirType

    The type of external directory. The value must be powercenter.

    path

    The full path to the folder where you stored the data source, for example, C:\path\to\config\dir.

    mask

    The pattern of the file names in the directory. By default, this is *.

    recursive

    Indicates whether to use recursive queries.

    Specify one of the following values:

    False
    The lineage harvester collects only the files in the folder specified by the path property. Files in subfolders of that folder are not collected. This is the default value.
    True
    The lineage harvester collects files in the folder specified by the path property and also files in its subfolders. Use this value if the folder specified by the path property contains subfolders.
    dependentSourceIds

    Option to specify data source dependencies for the sharing of database models. Sharing database models allows you to provide table-definition details from an independent data source to a data source that is dependent on those details. This is needed to avoid analysis errors and to have a complete lineage that includes lineage from the SQL statements from dependent data sources. For complete information, go to Sharing database models across data sources.

    If Database2 is dependent on Database1, include the dependentSourceIds property and specify the Source ID of Database1:

    "dependentSourceIds": ["<source ID of Database1>"]

    If, for example, Database2 is dependent on more than one data source, specify the independent sources, as follows:

    "dependentSourceIds": ["<source ID of an independent source>", "<source ID of another independent source>"]

    deleteRawMetadataAfterProcessing

    The lineage harvester harvests raw metadata from specified data sources and uploads it in a ZIP file to a Collibra Data Lineage service instance, for processing.

    You can use this optional property to specify whether or not the raw metadata should be deleted from Collibra Data Lineage service instance after the metadata that is targeted for ingestion in Data Catalog is processed.

    The default value is false.

    If the property is set to true, the raw source metadata is deleted after processing. If set to false, it is stored in the Collibra infrastructure.

  2. Save the configuration file.

Steps

  1. Open the lineage-harvester.conf file that was created when you installed the lineage harvester, and enter the values for each property.
    PropertiesDescription
    general

    This section describes the connection information between the lineage harvester and Data Catalog.

    techlin

    This section contains information that is necessary to connect to the Collibra Data Lineage service instance.

    Warning This applies only for Collibra Cloud for Government customers.

    url

    The URL of the Collibra Data Lineage service instance.“url”: “https://techlin-gov.collibra.com”

    Warning This applies only for Collibra Cloud for Government customers.

    userKey

    The unique API key to connect to the Collibra Data Lineage service instance.

    A unique user key is needed for each Collibra environment. If you're not sure what your user key is, please contact your Collibra Customer Success Manager.

    Warning This applies only for Collibra Cloud for Government customers.

    catalog

    This section contains information that is necessary to connect to Data Catalog.

    url

    The URL of your Collibra Data Intelligence Platform environment.

    Note You can only enter the public URL of your Collibra DGC environment. Other URLs will not be accepted.

    username

    The username that you use to sign in to Collibra.

    useCollibraSystemName

    Indicates whether or not you want to use the system or server name of a data source to match to the System asset in Data Catalog. Collibra Data Lineage uses the system names to match the structure of databases in Looker to assets in Data Catalog. This is useful when you have multiple databases with the same name.

    By default, the useCollibraSystemName property is set to false. If you want to use it, set it to true.

    Important 
    • If you set this property to true, the lineage harvester reads the value of the collibraSystemName property in your Looker <source-ID> configuration file.
    • If you set the useCollibraSystemName property to false, the lineage harvester ignores the collibraSystemName property in the Looker <source-ID> configuration file.
    sourcesThis section contains the Looker connection properties.
    id

    This property is used to identify the batch of harvested metadata on the Collibra Data Lineage service instance. The value can be anything as long as it is unique and human readable.

    Warning In the sources section of your lineage harvester configuration file, you can only specify one id property per Looker instance. If you have multiple id properties for a single Looker instance, ingestion will fail. If you have multiple id properties in the configuration file, it means you intend to ingest from multiple unique Looker instances.

    Warning If you are switching between the lineage harvester and Edge, the value of this property must exactly match the value of the Source ID field in your Edge capacity.

    type

    The kind of data source. In this case, the value has to be Looker.

    lookerUrl

    The URL to your Looker API.

    Tip There are two ways to find the Looker API URL:
    • In the API Host URL field in the Looker Admin menu. If this field is empty, you can use the default Looker API URL which you can find in the interactive API documentation.
    • In the interactive API documentation URL. It is the part of the URL before /api-docs/.

    Note Looker 3.1 APIs are deprecated; however, the API3 credentials for authorization and access control remain valid.

    clientId

    The username you use to access the Looker API.

    domainId

    The unique ID of the domain in Collibra Data Intelligence Platform in which you want to ingest the Looker assets.

    This is the default domain.

    If you want to ingest the contents of specific Looker Folders into specific domains in Collibra, you specify the domain reference IDs in the filters section of the Looker <source ID> configuration file.

    pagingLimit

    Optional property for customizing the Looker API pagination settings. The default value of "50" is sufficient in most cases; however, you can decrease it to help mitigate node limit errors, or increase it to speed up API calls.

    Note The paging limit option is known to cause issues when used with Looker Core instances. If you experience issues, for example a Received RST_STREAM: Protocol error, we recommend disabling pagination by setting the value to "0".

    Example "pagingLimit": 10

    concurrencyLevel

    This optional property is intended to help if you are experiencing HTTP 401 Unauthorized errors due to too many concurrent HTTP calls, using the same token. It allows you to specify the internal sizing, meaning the amount of tasks that can be executed at the same time.

    The default value is "15", meaning as many as 15 HTTP requests can take place in parallel. Consider reducing the value if you are experiencing HTTP 401 Unauthorized errors. Setting the value to "1" effectively disables the concurrency level, so that HTTP requests will be run in a synchronous manner, instead of in parallel.

    Example "concurrencyLevel": 5

    connectionTimeoutSeconds

    This optional property is intended to help avoid timeout errors, when the lineage harvester attempts to connect to your Looker instance. The default value is "30", meaning a timeout error is thrown if a connection is not established within 30 seconds.

    If timeout errors persist, try adding this property to you lineage harvester configuration file and setting the value to 60 or 90.

    Example "connectionTimeoutSeconds": 60

    dependentSourceIds

    Option to specify data source dependencies for the sharing of database models. Sharing database models allows you to provide table-definition details from an independent data source to a data source that is dependent on those details. This is needed to avoid analysis errors and to have a complete lineage that includes lineage from the SQL statements from dependent data sources. For complete information, go to Sharing database models across data sources.

    If Database2 is dependent on Database1, include the dependentSourceIds property and specify the Source ID of Database1:

    "dependentSourceIds": ["<source ID of Database1>"]

    If, for example, Database2 is dependent on more than one data source, specify the independent sources, as follows:

    "dependentSourceIds": ["<source ID of an independent source>", "<source ID of another independent source>"]

    deleteRawMetadataAfterProcessing

    The lineage harvester harvests raw metadata from specified data sources and uploads it in a ZIP file to a Collibra Data Lineage service instance, for processing.

    You can use this optional property to specify whether or not the raw metadata should be deleted from Collibra Data Lineage service instance after the metadata that is targeted for ingestion in Data Catalog is processed.

    The default value is false.

    If the property is set to true, the raw source metadata is deleted after processing. If set to false, it is stored in the Collibra infrastructure.

  2. Save the configuration file.

Steps

  1. Open the lineage-harvester.conf file that was created when you installed the lineage harvester, and enter the values for each property.

    Properties

    Description
    general

    This section describes the connection between Collibra Data Lineage and Data Catalog.

    techlin

    This section contains information that is necessary to connect to the Collibra Data Lineage service instance.

    Warning This section applies only to US government customers.

    url

    The URL of the Collibra Data Lineage service instance.

    Example “url”: “https://techlin-gov.collibra.com”

    Warning This section applies only to US government customers.

    userKey

    The unique API key to connect to the Collibra Data Lineage service instance.

    A unique user key is needed for each Collibra environment. If you're not sure what your user key is, please contact your Collibra Customer Success Manager.

    Warning This section applies only to US government customers.

    catalog

    This section contains information that is necessary to connect to Data Catalog.

    Note Versions of the lineage harvester older than 1.1.2 show collibra instead of catalog.

    url

    The URL of your Collibra environment.

    Note Enter the public URL of your Collibra environment. Other URLs are not accepted.

    username

    The username that you use to sign in to Collibra.

    useCollibraSystemName

    Indicates whether or not you want to use the system or server name of a JDBC data source to match the System asset that you created when you prepared the physical data layer. The names are case-sensitive.

    Specify one of the following values:

    false
    The lineage harvester ignores all system or server names that you specify on the collibraSystemName properties in the <source ID> file. This is the default value.
    true
    The lineage harvester reads the system and server names that you specify on the collibraSystemName properties in all sections of the <source ID> file.

    Note Specify this property with the value of true only when you have multiple databases with the same name.

    sources

    This section contains the required information for Matillion.

    Tip When you create a new project in Matillion, you define in which group you want to create the project, the project name and the environment name. This information is needed to enable the lineage harvester to access Matillion and scan your metadata.

    Important Currently, you can only create a technical lineage for Snowflake and Redshift projects in Matillion.

    id

    This property is used to identify the batch of harvested metadata on the Collibra Data Lineage service instance. The value can be anything as long as it is unique and human readable. For example, my_matillion_data_integration.

    type

    The kind of data source. In this case, the value has to be Matillion.

    url

    The URL of your Matillion environment. For example, https://<domain name> or https://<IP address>.

    groupName

    The name of your group in Matillion.

    projectName

    The name of your project in Matillion.

    You can only add the name of one project. If you want to create a technical lineage for other projects within the same group, create a new section in the lineage harvester configuration file.

    environmentName

    The name of your environment in Matillion.

    You can only add the name of one environment. If you want to create a technical lineage for other environments within the same project, create a new section in the lineage harvester configuration file.

    dialect

    The dialect of the database.

    startTimestamp

    The timestamp of tasks in Matillion. You can use this parameter to limit the amount of metadata that the lineage harvester scans.

    Specify this property with a UNIX timestamp in milliseconds.

    If this property remains empty or is deleted from the configuration file, all accessible tasks are scanned. Matillion provides seven days of history by default and automatically removes entries older than seven days.

    httpTimeout
    Sets the HTTP timeout duration in seconds. You can enter a value in the range of 0 to 3600. The default value is 15
    collibraSystemName (Deprecated)
    This property is deprecated. If you specify a value for this property, it is ignored. To override the default Collibra System asset name, use the useCollibraSystemName property in the source id file.
    auth

    The section contains the authentication details for signing in to Matillion.

    type

    The authentication method you want to use to sign in to Matillion.

    The value must be either:

    • Basic, for username and password authentication.
    • Token, for token-based authentication.

    Important These values are case-sensitive.

    username

    The username that you use to sign in to Matillion.

    Important This property is only required if you are using the username and password authentication method. If you are using token-based authentication, do not include this property.

    dependentSourceIds

    Option to specify data source dependencies for the sharing of database models. Sharing database models allows you to provide table-definition details from an independent data source to a data source that is dependent on those details. This is needed to avoid analysis errors and to have a complete lineage that includes lineage from the SQL statements from dependent data sources. For complete information, go to Sharing database models across data sources.

    If Database2 is dependent on Database1, include the dependentSourceIds property and specify the Source ID of Database1:

    "dependentSourceIds": ["<source ID of Database1>"]

    If, for example, Database2 is dependent on more than one data source, specify the independent sources, as follows:

    "dependentSourceIds": ["<source ID of an independent source>", "<source ID of another independent source>"]

    deleteRawMetadataAfterProcessing

    The lineage harvester harvests raw metadata from specified data sources and uploads it in a ZIP file to a Collibra Data Lineage service instance, for processing.

    You can use this optional property to specify whether or not the raw metadata should be deleted from Collibra Data Lineage service instance after the metadata that is targeted for ingestion in Data Catalog is processed.

    The default value is false.

    If the property is set to true, the raw source metadata is deleted after processing. If set to false, it is stored in the Collibra infrastructure.

  2. Save the configuration file.

Steps

  1. Open the lineage-harvester.conf file that was created when you installed the lineage harvester, and enter the values for each property.
    PropertiesDescription
    general

    This section describes the connection information between the lineage harvester and Data Catalog.

    techlin

    This section contains information that is necessary to connect to the Collibra Data Lineage service instance.

    Warning This applies only for Collibra Cloud for Government customers.

    url

    The URL of the Collibra Data Lineage service instance.“url”: “https://techlin-gov.collibra.com”

    Warning This applies only for Collibra Cloud for Government customers.

    userKey

    The unique API key to connect to the Collibra Data Lineage service instance.

    A unique user key is needed for each Collibra environment. If you're not sure what your user key is, please contact your Collibra Customer Success Manager.

    Warning This applies only for Collibra Cloud for Government customers.

    catalog

    This section contains information that is necessary to connect to Data Catalog.

    url

    The URL of your Collibra Data Intelligence Platform environment.

    Note You can only enter the public URL of your Collibra DGC environment. Other URLs will not be accepted.

    username

    The username that you use to sign in to Collibra.

    useCollibraSystemName

    Indicates whether or not you want to use the system or server name of a data source to match to the System asset in Data Catalog during automatic stitching. This is useful when you have multiple databases with the same name.

    By default, the useCollibraSystemName property is set to false. If you want to use it, set it to true.

    Important 
    • If you set this property to true, the lineage harvester reads the value of the collibraSystemName property in your MicroStrategy <source ID> configuration file.
    • If you set the useCollibraSystemName property to false, the lineage harvester ignores the collibraSystemName property in the <source-ID> configuration file.
    sourcesThis section contains the MicroStrategy connection properties.
    id

    This property is used to identify the batch of harvested metadata on the Collibra Data Lineage service instance. The value can be anything as long as it is unique and human readable. For example, my_microstrategy.

    Warning In the sources section of your lineage harvester configuration file, you can only specify one id property per MicroStrategy Intelligence Server. If you have multiple id properties for a single MicroStrategy Intelligence Server, ingestion will fail. If you have multiple id properties in the configuration file, it means you intend to ingest from multiple unique MicroStrategy Intelligence Servers.

    Warning If you are switching between the lineage harvester and Edge, the value of this property must exactly match the value of the Source ID field in your Edge capacity.

    type

    The kind of data source. In this case, the value has to be MSTR_V2.

    url

    The URL of your MicroStrategy account.

    username

    The username that you use to sign in to MicroStrategy.

    microStrategyLibraryUrl

    If you are using a custom URL to connect to the MicroStrategy Library Server, use this property to specify the custom library URL.

    Important You only need to specify the URL if both of the following are true:
    • You are connecting to a proxy server.
    • You are not using the default, hardcoded URL to the MicroStrategy Library Server.

      Example If the URL to your MicroStrategy Library is https://collibra.microstrategy.com/MicroStrategyLibrary/api, you don't need to use this property, as that is the default, hardcoded URL. However, if the URL is something like https://collibra.microstrategy.com/MicroStrategyLibraryProd/api, then use this property and configure it as follows:
      "microStrategyLibraryUrl": "MicroStrategyLibraryProd"

    maxParallelRequests

    This optional property allows you to specify the internal sizing, meaning the amount of tasks that can be executed at the same time.

    The default value is "1", which means that HTTP requests are run in a synchronous manner, instead of in parallel. As value of "5", for example, means that as many as 5 HTTP requests can take place in parallel.

    A lower value reduces the chances of experiencing HTTP 401 Unauthorized errors.

    requestTimeoutMs

    This optional property allows you to specify the maximum time, in milliseconds (ms), that the MicroStrategy Intelligence Server will wait for a request from the lineage harvester, before closing the connection.

    Tip A "connection timeout" refers to the amount of time that the lineage harvester will wait for a response from MicroStrategy. A "request timeout" is the converse of a connection timeout.

    The default value is "30000", or 30 seconds. A higher value reduces the chances of experiencing HTTP 408 Request Timeout errors.

    dependentSourceIds

    Option to specify data source dependencies for the sharing of database models. Sharing database models allows you to provide table-definition details from an independent data source to a data source that is dependent on those details. This is needed to avoid analysis errors and to have a complete lineage that includes lineage from the SQL statements from dependent data sources. For complete information, go to Sharing database models across data sources.

    If Database2 is dependent on Database1, include the dependentSourceIds property and specify the Source ID of Database1:

    "dependentSourceIds": ["<source ID of Database1>"]

    If, for example, Database2 is dependent on more than one data source, specify the independent sources, as follows:

    "dependentSourceIds": ["<source ID of an independent source>", "<source ID of another independent source>"]

    deleteRawMetadataAfterProcessing

    The lineage harvester harvests raw metadata from specified data sources and uploads it in a ZIP file to a Collibra Data Lineage service instance, for processing.

    You can use this optional property to specify whether or not the raw metadata should be deleted from Collibra Data Lineage service instance after the metadata that is targeted for ingestion in Data Catalog is processed.

    The default value is false.

    If the property is set to true, the raw source metadata is deleted after processing. If set to false, it is stored in the Collibra infrastructure.

    appUrlSuffix

    This optional property ensures that the correct URL to data objects in MicroStrategy is included on the asset pages of corresponding MicroStrategy assets. The required value depends on which platform you run MicroStrategy:

    • For J2EE, use: "appUrlSuffix": "MicroStrategy/servlet/mstrWeb"
    • For .NET, use: "appUrlSuffix": "MicroStrategy/asp/Main.aspx"

  2. Save the configuration file.

Steps

  1. Open the lineage-harvester.conf file that was created when you installed the lineage harvester, and enter the values for each property.

    Properties

    Description
    general

    This section describes the connection between Collibra Data Lineage and Data Catalog.

    techlin

    This section contains information that is necessary to connect to the Collibra Data Lineage service instance.

    Warning This section applies only to US government customers.

    url

    The URL of the Collibra Data Lineage service instance.

    Example “url”: “https://techlin-gov.collibra.com”

    Warning This section applies only to US government customers.

    userKey

    The unique API key to connect to the Collibra Data Lineage service instance.

    A unique user key is needed for each Collibra environment. If you're not sure what your user key is, please contact your Collibra Customer Success Manager.

    Warning This section applies only to US government customers.

    catalog

    This section contains information that is necessary to connect to Data Catalog.

    Note Versions of the lineage harvester older than 1.1.2 show collibra instead of catalog.

    url

    The URL of your Collibra environment.

    Note Enter the public URL of your Collibra environment. Other URLs are not accepted.

    username

    The username that you use to sign in to Collibra.

    useCollibraSystemName

    Indicates whether or not you want to use the system or server name of a JDBC data source to match the System asset that you created when you prepared the physical data layer. The names are case-sensitive.

    Specify one of the following values:

    false
    The lineage harvester ignores all system or server names that you specify on the collibraSystemName properties in the configuration file. This is the default value.
    true
    The lineage harvester reads the system and server names that you specify on the collibraSystemName properties in all sections of the configuration file. Only specify this value when you have multiple databases with the same name.
    Note  For SQL data sources, if this property is:
    • false, system or server names in table references in analyzed SQL code are ignored. This means that a table that exists in two different systems or servers is identified (either correctly or incorrectly) as a single data object, with a single asset name.
    • true, system or server names in table references are considered to be represented by different System assets in Data Catalog. The value of the collibraSystemName property is used as the default system or server name.
    sources

    This section describes the data sources for which you want to create the technical lineage. You have to create a configuration section for each data source. This configuration section contains the required information of one individual SQL directory with connection type "Folder".

    Note You can add multiple data sources to the same configuration file.

    id

    This property is used to identify the batch of harvested metadata on the Collibra Data Lineage service instance. The value can be anything as long as it is unique and human readable.

    type

    The kind of data source. In this case, the value has to be SqlDirectory.

    path

    The full path to the folder where you added SQL files, for example, C:\path\to\config\dir.

    mask

    The pattern of the file names in the directory. By default, this is *.

    recursive

    Indication of the files you want to harvest:

    • false (default): Only harvest the files in directly under the folder in the SQL directory path.
    • true: Harvest all files under the folder in the SQL directory path and subdirectories.
    dialect

    The dialect of the database. For example, oracle.

    The value your put for this property has to match the dialect you provide with in the directory with your SQL files.

    database

    The name of your database, which is the name of your Database asset.

    Note 
    • You have to use the same database name as the name of the Database asset that you create when you prepare the physical data layer in Data Catalog. The names are case-sensitive.
    • The database and schema names in the SQL statements in your SQL files take precedence over the values that you provide for the database and schema properties in the lineage harvester configuration file. If your SQL statements contain database and schema names, Collibra Data Lineage uses them for stitching. If your SQL statements do not contain database and schema names, Collibra Data Lineage uses the values of the database and schema properties in the configuration file for stitching.. For more information, go to Prepare an SQL directory and Automatic stitching for technical lineage.
    collibraSystemName

    The name of the data source's system or server. This is also the name of your System asset in Data Catalog.

    Specify this property with the same name as the name of the System asset that you create when you prepare the physical data layer in Data Catalog. If you don't prepare the physical data layer, Collibra Data Lineage cannot stitch the data objects in your technical lineage to the assets in Data Catalog.Specify this property with the same name as the name of the System asset that you created when you registered the data source.

    databaseSystemMapping

    This optional property allows you to map databases to their rightful systems, to obtain stitching. This resolves missing stitching, which occurs when Collibra Data Lineage associates multiple databases with the default system name that you provide in the collibraSystemName property.

    schema

    The name of the default schema, if not specified in the data source itself. This corresponds to name of your Schema asset.

    Note You must use the same schema name as the name of the Schema asset that you create when you prepare the physical data layer in Data Catalog.
    verbose

    Indication whether you want to enable verbose logging.

    By default this is set to True. If you don't want to use verbose logging, set it to False.

    dependentSourceIds

    Option to specify data source dependencies for the sharing of database models. Sharing database models allows you to provide table-definition details from an independent data source to a data source that is dependent on those details. This is needed to avoid analysis errors and to have a complete lineage that includes lineage from the SQL statements from dependent data sources. For complete information, go to Sharing database models across data sources.

    If Database2 is dependent on Database1, include the dependentSourceIds property and specify the Source ID of Database1:

    "dependentSourceIds": ["<source ID of Database1>"]

    If, for example, Database2 is dependent on more than one data source, specify the independent sources, as follows:

    "dependentSourceIds": ["<source ID of an independent source>", "<source ID of another independent source>"]

    deleteRawMetadataAfterProcessing

    The lineage harvester harvests raw metadata from specified data sources and uploads it in a ZIP file to a Collibra Data Lineage service instance, for processing.

    You can use this optional property to specify whether or not the raw metadata should be deleted from Collibra Data Lineage service instance after the metadata that is targeted for ingestion in Data Catalog is processed.

    The default value is false.

    If the property is set to true, the raw source metadata is deleted after processing. If set to false, it is stored in the Collibra infrastructure.

    Properties

    Description
    general

    This section describes the connection between Collibra Data Lineage and Data Catalog.

    techlin

    This section contains information that is necessary to connect to the Collibra Data Lineage service instance.

    Warning This section applies only to US government customers.

    url

    The URL of the Collibra Data Lineage service instance.

    Example “url”: “https://techlin-gov.collibra.com”

    Warning This section applies only to US government customers.

    userKey

    The unique API key to connect to the Collibra Data Lineage service instance.

    A unique user key is needed for each Collibra environment. If you're not sure what your user key is, please contact your Collibra Customer Success Manager.

    Warning This section applies only to US government customers.

    catalog

    This section contains information that is necessary to connect to Data Catalog.

    Note Versions of the lineage harvester older than 1.1.2 show collibra instead of catalog.

    url

    The URL of your Collibra environment.

    Note Enter the public URL of your Collibra environment. Other URLs are not accepted.

    username

    The username that you use to sign in to Collibra.

    useCollibraSystemName

    Indicates whether or not you want to use the system or server name of a JDBC data source to match the System asset that you created when you prepared the physical data layer. The names are case-sensitive. This is useful if you have multiple databases with the same name.

    sources

    This configuration section contains the required information for an Oracle database.

    id

    The unique ID of your Oracle database. For example, my_oracle_db.

    type

    The kind of data source. In this case, the value has to be DatabaseOracle.

    hostname
    The name of your database host.
    username
    The username that you use to sign in to your Oracle database.
    port
    The port number.
    sids

    One or more System Identifiers (SIDs). An SID is a unique name for an Oracle database instance on a specific host. You can use this property in conjunction with the databaseNames property, to preserve stitching.

    Important You must specify either one or more SIDs via this property, or one or more service names via the serviceNames property. You cannot include both properties in the configuration file.
    serviceNames

    One or more service names. A service name is the TNS alias that you give when you remotely connect to your database. You can use this property in conjunction with the databaseNames property, to preserve stitching.

    Important You must specify either one or more service names via this property, or one or more SIDs via the sids property. You cannot include both properties in the configuration file.
    databaseNames

    The names of one or more Oracle databases. You can use this optional property in conjunction with the sids or serviceNames property, to preserve stitching. The value you specify has to match your Database asset (or assets) in Collibra.

    Enter the Oracle database names between double quotes ("") and put everything between square brackets. If you want to include more than one database, separate them by a comma. For example, ["MyFirstDatabase", "MySecondDatabase"].

    • If you use this property, the database names that you specify have to correlate with the databases that you specify in the sids or serviceNames property.
    • If you don't use this property, the database name in the technical lineage will be the value that you put for the sids or serviceNames property.

    Tip For examples of how to configure this property, see the sids or serviceNames property descriptions and examples.

    jdbcUrl

    Optional property to override the default JDBC URL used to connect to the database.

    Use this when you need to use connection properties.

    Example: "jdbcUrl": "jdbc:oracle:thin:@db.example.com:1521/orclpdb1"

    collibraSystemName

    The name of the data source's system or server. This is also the name of your System asset in Data Catalog.

    Specify this property with the same name as the name of the System asset that you create when you prepare the physical data layer in Data Catalog. If you don't prepare the physical data layer, Collibra Data Lineage cannot stitch the data objects in your technical lineage to the assets in Data Catalog.
    Specify this property with the same name as the name of the System asset that you created when you registered the data source.

    If the useCollibraSystemName property is:

    • false (default), system or server names in table references in analyzed SQL code are ignored. This means that a table that exists in two different systems or servers is identified (either correctly or incorrectly) as a single data object, with a single asset name.
    • true, system or server names in table references are considered to be represented by different System assets in Data Catalog. The value of the collibraSystemName field is used as the default system or server name.

    databaseSystemMapping

    This optional property allows you to map databases to their rightful systems, to obtain stitching. This resolves missing stitching, which occurs when Collibra Data Lineage associates multiple databases with the default system name that you provide in the collibraSystemName property.

    databaseLinkMapping

    If you are using DBLinks, this optional property allows you to configure, per data source, the database and schema to which DBLink points.

    The configuration format is as follows:

    "databaseLinkMapping": {"<dblink_name>": {"database":"<database>","schema":"<schema>"}, ...}

    Tip  If you’re using a DBLink to target another source, you need to share the database model between the targeted (independent) source and the dependent source. Use the dependentSourceIds (Beta) property to configure that dependency and share the database model.

    Important If the same DBLink, for example dblink.example.com, exists in multiple databases, the formatting shown in the previous example still applies, but you need to enclose it in curly brackets and specify the relevant database, as follows:
    • Basic formatting, as shown in the previous example:
      "dblink.example.com": {"database":"Database_A","schema":"Schema_A1"}
    • Formatting if the DBLink exists in multiple databases and you want to apply it only in a database named "dbScope1":
      "dbScope1": {"dblink.example.com": {"database":"Database_A","schema":"Schema_A1"}}

    If a DBLink is referenced in multiple mappings, as shown in the following example, the first mapping is used.

    "dbScope1": {
       "dblink.example.com": {"database":"DevDB_A","schema":"DevSch_A1"}
    }, 
       "dblink.example.com": {"database":"Database_A","schema":"Schema_A1"}}

    In this case, occurrences of dblink.example.com in the database named "dbScope1" are mapped to:

    "database":"DevDB_A","schema":"DevSch_A1"

    dependentSourceIds

    Option to specify data source dependencies for the sharing of database models. Sharing database models allows you to provide table-definition details from an independent data source to a data source that is dependent on those details. This is needed to avoid analysis errors and to have a complete lineage that includes lineage from the SQL statements from dependent data sources. For complete information, go to Sharing database models across data sources.

    If Database2 is dependent on Database1, include the dependentSourceIds property and specify the Source ID of Database1:

    "dependentSourceIds": ["<source ID of Database1>"]

    If, for example, Database2 is dependent on more than one data source, specify the independent sources, as follows:

    "dependentSourceIds": ["<source ID of an independent source>", "<source ID of another independent source>"]

    deleteRawMetadataAfterProcessing

    The lineage harvester harvests raw metadata from specified data sources and uploads it in a ZIP file to a Collibra Data Lineage service instance, for processing.

    You can use this optional property to specify whether or not the raw metadata should be deleted from Collibra Data Lineage service instance after the metadata that is targeted for ingestion in Data Catalog is processed.

    The default value is false.

    If the property is set to true, the raw source metadata is deleted after processing. If set to false, it is stored in the Collibra infrastructure.

  2. Save the configuration file.

Steps

  1. Open the lineage-harvester.conf file that was created when you installed the lineage harvester, and enter the values for each property.
    PropertiesDescription
    general

    This section describes the necessary connection information.

    techlin

    This section contains information that is necessary to connect to the Collibra Data Lineage service instance.

    Warning This applies only for Collibra Cloud for Government customers.

    url

    The URL of the Collibra Data Lineage service instance.“url”: “https://techlin-gov.collibra.com”

    Warning This applies only for Collibra Cloud for Government customers.

    userKey

    The unique API key to connect to the Collibra Data Lineage service instance.

    A unique user key is needed for each Collibra environment. If you're not sure what your user key is, please contact your Collibra Customer Success Manager.

    Warning This applies only for Collibra Cloud for Government customers.

    catalog

    This section contains information that is necessary to connect to Data Catalog.

    url

    The URL of your Collibra environment.

    Note You can only enter the public URL of your Collibra DGC environment. Other URLs are not accepted.

    username

    The username that you use to sign in to Collibra.

    useCollibraSystemName

    Indicates whether or not you want to use the system or server name of a data source to match to the System asset in Data Catalog during automatic stitching. This is useful when you have multiple databases with the same name.

    By default, the useCollibraSystemName property is set to false. If you want to use it, set it to true.

    Important 
    • If you set this property to true, the lineage harvester reads the value of the collibraSystemName property in your Power BI <source ID> configuration file.
    • If you set the useCollibraSystemName property to false, the lineage harvester ignores the collibraSystemName property in the Power BI <source-ID> configuration file.
    sources

    This section describes the data sources for which you want to create the technical lineage. You have to create a configuration section for each data source.

    Note You can add multiple data sources to the same configuration file, but you can't have multiple sources sections that refer to the same tenant.

    scope

    Optional property that is intended only for customers with a different scope, such as Chinese tenants.

    Example “scope” : “https://analysis.chinacloudapi.cn/powerbi/api/.default”

    Important If you are a US government or national cloud Power BI customer, you must include and specify values for both this property and the apiUrl property. For complete information, consult Microsoft's documentation on Power BI for US government customers.

    apiUrl

    The API URL of your Power BI service.

    The default value is https://api.powerbi.com.

    Important This property is only relevant for US government or national cloud Power BI customers, in which case you must include and specify values for both this property and the scope property. For complete information, consult Microsoft's documentation on Power BI for US government customers.

    type
    The kind of data source. In this case, the value has to be PowerBI.
    id

    This property is used to identify the batch of harvested metadata on the Collibra Data Lineage service instance. The value can be anything as long as it is unique and human readable. For example, my_power-bi.

    Warning In the sources section of your lineage harvester configuration file, you can only specify one id property per Power BI service. If you have multiple id properties for a single Power BI service, ingestion will fail. If you have multiple id properties in the configuration file, it means you intend to ingest from multiple unique Power BI services.

    Warning If you are switching between the lineage harvester and Edge, the value of this property must exactly match the value of the Source ID field in your Edge capacity.

    tenantDomain

    The Power BI tenant domain is the domain associated with the Microsoft Azure tenant. It is either a default domain or a custom domain. You can specify this property with one of the following:

    • The appropriate part of the URL, for example collibrapowerbi.onmicrosoft.com. Do not include the http:// part of the URL.
    • The tenant ID, for example e**b****-****-****-****-1b**d****4663.

    Tip Usually, you can find a list of Power BI tenant or server domains in your Azure Active Directory or in the upper-right menu.

    loginFlow

    This section describes the authentication information for accessing your Power BI metadata.

    The lineage harvester supports two authentication methods: service principal, and username and password. For complete information on your authentication options, see Authentication.

    type

    This depends on the authentication method you use.

    • Service principle: The value should be ServicePrincipal.
    • Username and password: The value should be ResourceOwnerPasswordCredentials.
    applicationId
    The unique ID of the Microsoft Azure Application (client) ID.
    username

    The email address of your Azure Active Directory user.

    Tip This property only applies if you are using the username and password authentication method.

    domainId
    The reference ID of the domain in Collibra in which you want to ingest Power BI metadata.
    useHttp1
    Optional property to use HTTP/1.1 streams, in case file-size limitations are resulting in timeout errors when using the default HTTP/2 streams.
    daxParserEnabled

    Note This feature is not available on Collibra Cloud for Government.

    Optional property for enabling DAX analysis via Collibra AI. This feature:

    • Allows you to create column-level lineage that includes your calculated columns and measures in Power BI.
    • Enables stitching between calculated columns in the technical lineage and the corresponding Power BI Column assets in Data Catalog.

    The default value is false. To enable DAX analysis, set the value to "daxParserEnabled": true.

    For complete information on DAX analysis, go to DAX analysis via Collibra AI.

    dependentSourceIds

    Option to specify data source dependencies for the sharing of database models. Sharing database models allows you to provide table-definition details from an independent data source to a data source that is dependent on those details. This is needed to avoid analysis errors and to have a complete lineage that includes lineage from the SQL statements from dependent data sources. For complete information, go to Sharing database models across data sources.

    If Database2 is dependent on Database1, include the dependentSourceIds property and specify the Source ID of Database1:

    "dependentSourceIds": ["<source ID of Database1>"]

    If, for example, Database2 is dependent on more than one data source, specify the independent sources, as follows:

    "dependentSourceIds": ["<source ID of an independent source>", "<source ID of another independent source>"]

    deleteRawMetadataAfterProcessing

    The lineage harvester harvests raw metadata from specified data sources and uploads it in a ZIP file to a Collibra Data Lineage service instance, for processing.

    You can use this optional property to specify whether or not the raw metadata should be deleted from Collibra Data Lineage service instance after the metadata that is targeted for ingestion in Data Catalog is processed.

    The default value is false.

    If the property is set to true, the raw source metadata is deleted after processing. If set to false, it is stored in the Collibra infrastructure.

  2. Save the configuration file.

Steps

  1. Open the lineage-harvester.conf file that was created when you installed the lineage harvester, and enter the values for each property.

    Properties

    Description
    general

    This section describes the connection between Collibra Data Lineage and Data Catalog.

    techlin

    This section contains information that is necessary to connect to the Collibra Data Lineage service instance.

    Warning This section applies only to US government customers.

    url

    The URL of the Collibra Data Lineage service instance.

    Example “url”: “https://techlin-gov.collibra.com”

    Warning This section applies only to US government customers.

    userKey

    The unique API key to connect to the Collibra Data Lineage service instance.

    A unique user key is needed for each Collibra environment. If you're not sure what your user key is, please contact your Collibra Customer Success Manager.

    Warning This section applies only to US government customers.

    catalog

    This section contains information that is necessary to connect to Data Catalog.

    Note Versions of the lineage harvester older than 1.1.2 show collibra instead of catalog.

    url

    The URL of your Collibra environment.

    Note Enter the public URL of your Collibra environment. Other URLs are not accepted.

    username

    The username that you use to sign in to Collibra.

    useCollibraSystemName

    Indicates whether or not you want to use the system or server name of a JDBC data source to match the System asset that you created when you prepared the physical data layer. The names are case-sensitive.

    Specify one of the following values:

    false
    The lineage harvester ignores all system or server names that you specify on the collibraSystemName properties in the <source ID> file. This is the default value.
    true
    The lineage harvester reads the system and server names that you specify on the collibraSystemName properties in all sections of the <source ID> file.

    Note Specify this property with the value of true only when you have multiple databases with the same name.

    sources

    This configuration section contains the required information to connect to SQL Server Integration Services (SSIS).

    Note Make sure that you have prepared a local folder with the SSIS files for which you want to create a technical lineage.

    collibraSystemName (Deprecated)
    This property is deprecated. If you specify a value for this property, it is ignored. To override the default Collibra System asset name, use the useCollibraSystemName property in the source id file.
    id

    This property is used to identify the batch of harvested metadata on the Collibra Data Lineage service instance. The value can be anything as long as it is unique and human readable. For example, my_ssis.

    type

    The kind of data source. In this case, the value has to be ExternalDirectory.

    dirType

    The type of external directory. The value has to be ssis.

    path

    The full path to the folder where you stored the data source, for example, C:\path\to\config\dir.

    mask

    The pattern of the file names in the directory. By default, this is *.

    recursive

    Indication whether you want to use recursive queries.

    By default, this is set to False. If you want to use recursive query, set it to True.

    dependentSourceIds

    Option to specify data source dependencies for the sharing of database models. Sharing database models allows you to provide table-definition details from an independent data source to a data source that is dependent on those details. This is needed to avoid analysis errors and to have a complete lineage that includes lineage from the SQL statements from dependent data sources. For complete information, go to Sharing database models across data sources.

    If Database2 is dependent on Database1, include the dependentSourceIds property and specify the Source ID of Database1:

    "dependentSourceIds": ["<source ID of Database1>"]

    If, for example, Database2 is dependent on more than one data source, specify the independent sources, as follows:

    "dependentSourceIds": ["<source ID of an independent source>", "<source ID of another independent source>"]

    deleteRawMetadataAfterProcessing

    The lineage harvester harvests raw metadata from specified data sources and uploads it in a ZIP file to a Collibra Data Lineage service instance, for processing.

    You can use this optional property to specify whether or not the raw metadata should be deleted from Collibra Data Lineage service instance after the metadata that is targeted for ingestion in Data Catalog is processed.

    The default value is false.

    If the property is set to true, the raw source metadata is deleted after processing. If set to false, it is stored in the Collibra infrastructure.

  2. Save the configuration file.

Steps

  1. Open the lineage-harvester.conf file that was created when you installed the lineage harvester, and enter the values for each property.

    Properties

    Description
    general

    This section describes the connection between Collibra Data Lineage and Data Catalog.

    techlin

    This section contains information that is necessary to connect to the Collibra Data Lineage service instance.

    Warning This section applies only to US government customers.

    url

    The URL of the Collibra Data Lineage service instance.

    Example “url”: “https://techlin-gov.collibra.com”

    Warning This section applies only to US government customers.

    userKey

    The unique API key to connect to the Collibra Data Lineage service instance.

    A unique user key is needed for each Collibra environment. If you're not sure what your user key is, please contact your Collibra Customer Success Manager.

    Warning This section applies only to US government customers.

    catalog

    This section contains information that is necessary to connect to Data Catalog.

    Note Versions of the lineage harvester older than 1.1.2 show collibra instead of catalog.

    url

    The URL of your Collibra environment.

    Note Enter the public URL of your Collibra environment. Other URLs are not accepted.

    username

    The username that you use to sign in to Collibra.

    useCollibraSystemName

    Indicates whether or not you want to use the system or server name of a JDBC data source to match the System asset that you created when you prepared the physical data layer. The names are case-sensitive.

    Specify one of the following values:

    false
    The lineage harvester ignores all system or server names that you specify on the collibraSystemName properties in the configuration file. This is the default value.
    true
    The lineage harvester reads the system and server names that you specify on the collibraSystemName properties in all sections of the configuration file. Only specify this value when you have multiple databases with the same name.
    Note  For SQL data sources, if this property is:
    • false, system or server names in table references in analyzed SQL code are ignored. This means that a table that exists in two different systems or servers is identified (either correctly or incorrectly) as a single data object, with a single asset name.
    • true, system or server names in table references are considered to be represented by different System assets in Data Catalog. The value of the collibraSystemName property is used as the default system or server name.
    sources

    This section describes the data sources for which you want to create the technical lineage. You have to create a configuration section for each data source. This configuration section contains the required information of one individual SQL directory with connection type "Folder".

    Note You can add multiple data sources to the same configuration file.

    id

    This property is used to identify the batch of harvested metadata on the Collibra Data Lineage service instance. The value can be anything as long as it is unique and human readable.

    type

    The kind of data source. In this case, the value has to be SqlDirectory.

    path

    The full path to the folder where you added SQL files, for example, C:\path\to\config\dir.

    mask

    The pattern of the file names in the directory. By default, this is *.

    recursive

    Indication of the files you want to harvest:

    • false (default): Only harvest the files in directly under the folder in the SQL directory path.
    • true: Harvest all files under the folder in the SQL directory path and subdirectories.
    dialect

    The dialect of the database. For example, snowflake.

    The value your put for this property has to match the dialect you provide with in the directory with your SQL files.

    database

    The name of your database, which is the name of your Database asset.

    Note 
    • You have to use the same database name as the name of the Database asset that you create when you prepare the physical data layer in Data Catalog. The names are case-sensitive.
    • The database and schema names in the SQL statements in your SQL files take precedence over the values that you provide for the database and schema properties in the lineage harvester configuration file. If your SQL statements contain database and schema names, Collibra Data Lineage uses them for stitching. If your SQL statements do not contain database and schema names, Collibra Data Lineage uses the values of the database and schema properties in the configuration file for stitching.. For more information, go to Prepare an SQL directory and Automatic stitching for technical lineage.
    collibraSystemName

    The name of the data source's system or server. This is also the name of your System asset in Data Catalog.

    Specify this property with the same name as the name of the System asset that you create when you prepare the physical data layer in Data Catalog. If you don't prepare the physical data layer, Collibra Data Lineage cannot stitch the data objects in your technical lineage to the assets in Data Catalog.

    databaseSystemMapping

    This optional property allows you to map databases to their rightful systems, to obtain stitching. This resolves missing stitching, which occurs when Collibra Data Lineage associates multiple databases with the default system name that you provide in the collibraSystemName property.

    schema

    The name of the default schema, if not specified in the data source itself. This corresponds to name of your Schema asset.

    Note You must use the same schema name as the name of the Schema asset that you create when you prepare the physical data layer in Data Catalog.
    verbose

    Indication whether you want to enable verbose logging.

    By default this is set to True. If you don't want to use verbose logging, set it to False.

    dependentSourceIds

    Option to specify data source dependencies for the sharing of database models. Sharing database models allows you to provide table-definition details from an independent data source to a data source that is dependent on those details. This is needed to avoid analysis errors and to have a complete lineage that includes lineage from the SQL statements from dependent data sources. For complete information, go to Sharing database models across data sources.

    If Database2 is dependent on Database1, include the dependentSourceIds property and specify the Source ID of Database1:

    "dependentSourceIds": ["<source ID of Database1>"]

    If, for example, Database2 is dependent on more than one data source, specify the independent sources, as follows:

    "dependentSourceIds": ["<source ID of an independent source>", "<source ID of another independent source>"]

    deleteRawMetadataAfterProcessing

    The lineage harvester harvests raw metadata from specified data sources and uploads it in a ZIP file to a Collibra Data Lineage service instance, for processing.

    You can use this optional property to specify whether or not the raw metadata should be deleted from Collibra Data Lineage service instance after the metadata that is targeted for ingestion in Data Catalog is processed.

    The default value is false.

    If the property is set to true, the raw source metadata is deleted after processing. If set to false, it is stored in the Collibra infrastructure.

    Properties

    Description
    general

    This section describes the connection between Collibra Data Lineage and Data Catalog.

    techlin

    This section contains information that is necessary to connect to the Collibra Data Lineage service instance.

    Warning This section applies only to US government customers.

    url

    The URL of the Collibra Data Lineage service instance.

    Example “url”: “https://techlin-gov.collibra.com”

    Warning This section applies only to US government customers.

    userKey

    The unique API key to connect to the Collibra Data Lineage service instance.

    A unique user key is needed for each Collibra environment. If you're not sure what your user key is, please contact your Collibra Customer Success Manager.

    Warning This section applies only to US government customers.

    catalog

    This section contains information that is necessary to connect to Data Catalog.

    Note Versions of the lineage harvester older than 1.1.2 show collibra instead of catalog.

    url

    The URL of your Collibra environment.

    Note Enter the public URL of your Collibra environment. Other URLs are not accepted.

    username

    The username that you use to sign in to Collibra.

    useCollibraSystemName

    Indicates whether or not you want to use the system or server name of a JDBC data source to match the System asset that you created when you prepared the physical data layer. The names are case-sensitive. This is useful if you have multiple databases with the same name.

    sources

    This section contains the Snowflake connection properties. If you want to create the technical lineage for multiple data sources, create a sources section for each data source.

    id

    The unique ID that identifies the data source on a Collibra Data Lineage service instance, for example, my_snowflake_2.

    type

    The type of data source. The value must be DatabaseSnowflake.

    mode

    The Snowflake ingestion methods that Collibra Data Lineage uses to ingest metadata from Snowflake data sources.

    Specify one of the following values:

    SQL
    The SQL Snowflake ingestion mode. Collibra Data Lineage creates a column-level technical lineage based on SQL statements.
    This is the default value.
    SQL-API
    The SQL-API Snowflake ingestion mode. Collibra Data Lineage creates a column-level technical lineage based on Snowflake schemas and the access history.

    For more information, go to Technical lineage for Snowflake ingestion methods.

    collibraSystemName

    Use this property with the useCollibraSystemName property in the lineage harvester configuration file to override the default Collibra System asset name for this data source.

    Specify this property with the same name as the name of the System asset that you create when you prepare the physical data layer in Data Catalog. If you don't prepare the physical data layer, Collibra Data Lineage cannot stitch the data objects in your technical lineage to the assets in Data Catalog.

    Note The collibraSystemName property in the source ID file takes precedence over the collibraSystemName property in the lineage harvester configuration file. If you specify the collibraSystemName property in the source ID file, you can omit the property in the configuration file.
    auth

    This section indicates the authentication details to connect to the Snowflake database.

    Note The username and auth properties are mutually exclusive.

    type

    The authentication method.

    Specify one of the following values. The values are case-sensitive.

    Basic
    The username and password authentication method. Specify the auth.username property if you use this authentication method.
    KeyPair
    The key pair authentication method. Specify the auth.username, auth.pathToPrivateKey, and auth.usePassword properties if you use this authentication method.
    username
    The user name that you use to connect to the Snowflake database. This property is required for both the username and password authentication method and the key pair authentication method.
    pathToPrivateKey

    The path to your private key file. This property is required if you use the key pair authentication method.

    Ensure that the private key matches the public key; otherwise, an error occurs indicating that the JWT token is invalid. For more information about the error, go to Snowflake JDBC driver error at login: net.snowflake.client.jdbc.SnowflakeSQLException: JWT token is invalid in Collibra Support Portal.

    usePassword

    The private key file password.

    This property is required if you use the key pair authentication method. Specify one of the following values:

    true
    The password is required.
    false
    The password is not required. This is the default value.
    username

    The username that you use to sign in to your Snowflake data source.

    Note This property is deprecated. Use the property instead. The property and the property are mutually exclusive.

    hostname

    The URL that you use to access Snowflake web console. When you enter the URL, do not include https:// or the trailing slash (/). For example, specify <accountName>.snowflakecomputing.com.

    databaseNames

    An array of database names. Ensure that the database names you specify match the Database asset names that you created when you prepared the physical data layer in Data Catalog.

    Enter the database names of your data source between double quotes ("") and put everything between square brackets ([]). If you want to include more than one database, separate them by a comma, for example, ["MyFirstSnowflakeDatabase", "MySecondSnowflakeDatabase"].

    extraDatabaseDefinitions

    An array of database names. Collibra Data Lineage collects metadata from the specified databases, but excludes these databases from the technical lineage that is created. This property is useful for stitching across databases. You can specify cross-referenced databases to ensure correct lineage across all databases that Collibra Data Lineage processes to create the technical lineage.

    This property is optional. To specify this property, enter the database names between double quotes ("") and put everything between square brackets ([]). If you want to include more than one database, separate them by a comma, for example, ["MyFirstSnowflakeExternalDatabase", "MySecondSnowflakeExternalDatabase"].

    schemaNames

    An array of schema names of your data sources. This property takes effect only when you use the SQL-API Snowflake ingestion mode. You can use this property as a filter to include lineage for objects only in the specified schemas.

    Ensure that the schema names you specify match the Schema asset names that you created when you registered the data source in Data Catalog

    Enter the schema names between double quotes ("") and put everything between square brackets ([]). If you want to include more than one schema, separate them by a comma, for example, ["MyFirstSnowflakeSchema", "MySecondSnowflakeSchema"].

    warehouse

    The name of your virtual warehouse. This property is optional.

    days

    The number of days of the user access history that Collibra Data Lineage collects and processes. For example, if you set the value to 20, Collibra Data Lineage collects the last 20 days of user access history.

    You can use this property to limit data retrieval from the ACCESS_HISTORY table. This property is optional and takes effect only when you use the SQL-API Snowflake ingestion mode.

    Specify a value in the range of 1 - 366. If you do not enter a value, all user access history is collected by default.

    Note A higher value of this property results in Collibra Data Lineage retrieving more data from Snowflake. This might cause a 413 Payload Too Large error when Collibra Data Lineage analyzes the metadata to create the technical lineage.
    customConnectionProperties

    An option to enable the lineage harvester to read additional connection parameters. This parameter is only required in very specific situations. If you don't need it, you can remove it from the configuration file.

    Example If you get an OSCP scan error, you can turn OSCP checking off by using the following value: insecureMode=true.

    dependentSourceIds

    Option to specify data source dependencies for the sharing of database models. Sharing database models allows you to provide table-definition details from an independent data source to a data source that is dependent on those details. This is needed to avoid analysis errors and to have a complete lineage that includes lineage from the SQL statements from dependent data sources. For complete information, go to Sharing database models across data sources.

    If Database2 is dependent on Database1, include the dependentSourceIds property and specify the Source ID of Database1:

    "dependentSourceIds": ["<source ID of Database1>"]

    If, for example, Database2 is dependent on more than one data source, specify the independent sources, as follows:

    "dependentSourceIds": ["<source ID of an independent source>", "<source ID of another independent source>"]

    deleteRawMetadataAfterProcessing

    The lineage harvester harvests raw metadata from specified data sources and uploads it in a ZIP file to a Collibra Data Lineage service instance, for processing.

    You can use this optional property to specify whether or not the raw metadata should be deleted from Collibra Data Lineage service instance after the metadata that is targeted for ingestion in Data Catalog is processed.

    The default value is false.

    If the property is set to true, the raw source metadata is deleted after processing. If set to false, it is stored in the Collibra infrastructure.

  2. Save the configuration file.

Steps

  1. Open the lineage-harvester.conf file that was created when you installed the lineage harvester, and enter the values for each property.
    PropertiesDescription
    general

    This section describes the connection information between the lineage harvester and Data Catalog.

    techlin

    This section contains information that is necessary to connect to the Collibra Data Lineage service instance.

    Warning This applies only for Collibra Cloud for Government customers.

    url

    The URL of the Collibra Data Lineage service instance.“url”: “https://techlin-gov.collibra.com”

    Warning This applies only for Collibra Cloud for Government customers.

    userKey

    The unique API key to connect to the Collibra Data Lineage service instance.

    A unique user key is needed for each Collibra environment. If you're not sure what your user key is, please contact your Collibra Customer Success Manager.

    Warning This applies only for Collibra Cloud for Government customers.

    catalog

    This section contains information that is necessary to connect to Data Catalog.

    url

    The URL of your Collibra Data Intelligence Platform environment.

    Note You can only enter the public URL of your Collibra Data Intelligence Platform environment. Other URLs will not be accepted.

    username

    The username that you use to sign in to Collibra.

    useCollibraSystemName

    Indication whether you want to use the system or server name of a data source to match to the System asset you created when you prepared the physical data layer. This is useful when you have multiple databases with the same name.

    By default, the useCollibraSystemName property is set to false. If you want to use it, set it to true.

    Important 
    • If you set this property to true, the lineage harvester reads the value of the collibraSystemName property in your SSRS-PBRS <source-ID> configuration file.
    • If you set the useCollibraSystemName property to false, the lineage harvester ignores the collibraSystemName property in the <source-ID> configuration file.
    sourcesThis section contains the SSRS connection properties.
    id

    This property is used to identify the batch of harvested metadata on the Collibra Data Lineage service instance. The value can be anything as long as it is unique and human readable.

    Warning In the sources section of your lineage harvester configuration file, you can only specify one id property per SQL Server Reporting Service (SSRS) or Power BI Report Server (PBRS). If you have multiple id properties for a single SSRS or PBRS, ingestion will fail. If you have multiple id properties in the configuration file, it means you intend to ingest from multiple unique SSRS or PBRS.

    Warning If you are switching between the lineage harvester and Edge, the value of this property must exactly match the value of the Source ID field in your Edge capacity.

    type
    The kind of data source. In this case, the value has to be SSRS or PBIRS.

    Note There is no difference between type SSRS or PBIRS.

    url

    The URL to the server's web portal. By default, the URL is http://<computer-name>/reports. For example, "http://1.23.45.678/PowerBIReports".

    username

    The username you use to sign in to the web portal.

    Tip If you use NTLM authentication, your username also contains the NTLM domain name. For example MyDomain\\username.

    domainId

    The unique ID of the domain in Collibra Data Intelligence Platform in which you want to ingest the assets.

    folderFilter

    This property allows you to include only specific folders that contain reports or KPIs in the ingestion process.

    Important This is a mandatory property and you must provide a value. If you want to ingest all folders, use *, for example: "folderFilter":["*"].

    You can filter on multiple folders by:

    • Specifying folder names.
    • Specifying the full path to folders.
    • Using a wildcard.
    • Using a combination of these approaches. For example: ["folder1", "/database/folder2", /folder3/*"]

    Tip For more information about connecting to a SSRS or PBRS folder, see the Microsoft documentation.

    dependentSourceIds

    Option to specify data source dependencies for the sharing of database models. Sharing database models allows you to provide table-definition details from an independent data source to a data source that is dependent on those details. This is needed to avoid analysis errors and to have a complete lineage that includes lineage from the SQL statements from dependent data sources. For complete information, go to Sharing database models across data sources.

    If Database2 is dependent on Database1, include the dependentSourceIds property and specify the Source ID of Database1:

    "dependentSourceIds": ["<source ID of Database1>"]

    If, for example, Database2 is dependent on more than one data source, specify the independent sources, as follows:

    "dependentSourceIds": ["<source ID of an independent source>", "<source ID of another independent source>"]

    deleteRawMetadataAfterProcessing

    The lineage harvester harvests raw metadata from specified data sources and uploads it in a ZIP file to a Collibra Data Lineage service instance, for processing.

    You can use this optional property to specify whether or not the raw metadata should be deleted from Collibra Data Lineage service instance after the metadata that is targeted for ingestion in Data Catalog is processed.

    The default value is false.

    If the property is set to true, the raw source metadata is deleted after processing. If set to false, it is stored in the Collibra infrastructure.

  2. Save the configuration file.

Steps

  1. Open the lineage-harvester.conf file that was created when you installed the lineage harvester, and enter the values for each property.
    PropertiesDescription
    general

    This section describes the connection information between the lineage harvester and Data Catalog.

    techlin

    This section contains information that is necessary to connect to the Collibra Data Lineage service instance.

    Warning This applies only for Collibra Cloud for Government customers.

    url

    The URL of the Collibra Data Lineage service instance.“url”: “https://techlin-gov.collibra.com”

    Warning This applies only for Collibra Cloud for Government customers.

    userKey

    The unique API key to connect to the Collibra Data Lineage service instance.

    A unique user key is needed for each Collibra environment. If you're not sure what your user key is, please contact your Collibra Customer Success Manager.

    Warning This applies only for Collibra Cloud for Government customers.

    catalog

    This section contains information that is necessary to connect to Data Catalog.

    url

    The URL of your Collibra Data Intelligence Platform environment.

    Note You can only enter the public URL of your Collibra DGC environment. Other URLs will not be accepted.

    username

    The username that you use to sign in to Collibra.

    useCollibraSystemName

    Indication whether you want to use the system or server name of a data source to match to the System asset you created when you prepared the physical data layer. This is useful when you have multiple databases with the same name.

    By default, the useCollibraSystemName property is set to false. If you want to use it, set it to true.

    Important 
    • If you set this property to true, the lineage harvester reads the value of the collibraSystemName property in your Tableau <source-ID> configuration file.
    • If you set the useCollibraSystemName property to false, the lineage harvester ignores the collibraSystemName property in the <source-ID> configuration file.
    Note If you set the useCollibraSystemName property to true, but you don't define the system name in the Tableau <source ID> configuration file, the system name in the technical lineage is DEFAULT.
    type

    The kind of data source. In this case, the value has to be Tableau.

    sourcesThis section contains the Tableau connection properties.
    id

    This property is used to identify the batch of harvested metadata on the Collibra Data Lineage service instance. The value can be anything as long as it is unique and human readable. For example, my_tableau.

    Warning In the sources section of your lineage harvester configuration file, you can only specify one id property per Tableau server or Tableau online account. If you have multiple id properties for a single Tableau server or Tableau online account, ingestion will fail. If you have multiple id properties in the configuration file, it means you intend to ingest from multiple unique Tableau servers or Tableau online accounts.

    Warning If you are switching between the lineage harvester and Edge, the value of this property must exactly match the value of the Source ID field in your Edge capacity.

    url

    The link to the data in Tableau.

    username

    The username you use to sign in to the Tableau server.

    Warning As of October 2022, Tableau is enforcing multi-factor authentication for Tableau Cloud Admin users. However, the lineage harvester doesn’t support multi-factor authentication. Therefore, Tableau Cloud users with an Admin role must use token-based authentication. This does not affect Tableau Server users or Tableau Cloud users with an Explorer role.

    Important If you want to use token-based authentication, you need to replace username with tokenName. You must specify either username or tokenName; if both exist, then tokenName is used.

    tokenName

    The lineage harvester authentication token.

    Note For token-based authentication, use this property in your lineage harvester configuration file, instead of the username property. If both properties are present, tokenName is used.

    siteIds

    The site IDs of the Tableau sites that you want to include in the ingestion process.

    If you want to ingest the metadata in a Tableau site in a specific domain, specify the following properties:

    Important The site ID is the URL of the site to which you want to sign in. When you manually sign in to Tableau Server or Tableau Online, the site ID is the value that appears after /site/ in the browser address bar. In the following example URLs, the site ID is MarketingTeam:
    • Tableau Server: http://MyServer/#/site/MarketingTeam/projects
    • Tableau Online: https://10ay.online.tableau.com/#/site/MarketingTeam/workbooks

    On Tableau Server, however, the URL of the Default site does not specify the site. For example, the URL for a view named Profits, on a site named Sales, is http://localhost/#/site/sales/views/profits. The URL for this same view on the Default site is http://localhost/#/views/profits. The site name Sales does not figure in the URL. If you can't see the site ID, leave this property empty: "siteIds": [""]

    Example If you want to ingest two Tableau sites "Site 1" and "Site 2", you can enter the following information in the siteIds property: ["site ID of Site 1", "site ID of Site 2"].
    siteNames

    The site names of the corresponding site IDs.

    Important This property is:
    • Optional for Tableau Server
    • Mandatory for Tableau Online.
    Warning If you have Tableau Server and you don't use this property, you must delete it from your configuration file. Don't leave the property in the configuration file without a value.
    restOnly

    Indication whether or not you would like to use both the Tableau REST API and Tableau Metadata API to harvest Tableau metadata.

    • false (default): The lineage harvester will use the REST API and Metadata API to harvest Tableau metadata.
    • true: The lineage harvester will only use the REST API to harvest Tableau metadata.
    Note This property must be set to false, to:
    • Enable technical lineage and the automatic stitching of Column assets to Tableau Data Attribute assets.
    • Harvest owner information for Tableau projects, workbooks and data models.
    domainId

    The unique reference ID of the domain in Collibra Data Intelligence Platform in which you want to ingest the Tableau assets. This property represents the default domain.

    excludeImages

    Optional property for excluding the downloading of images.

    To exclude the downloading of images, set this property to true.

    To indicate the projects that you want to ingest in different domains, specify the filters section in your Tableau <source ID> configuration file.

    Note The maximum number of images that can be uploaded to Collibra per day is determined by the configuration of the file upload service, in Collibra Console. For complete details, see the Upload configuration settings in DGC service configuration: options.

    concurrencyLevel

    This optional property is intended to help if you are experiencing HTTP 401 Unauthorized errors due to too many concurrent HTTP calls, using the same token. It allows you to specify the internal sizing, meaning the amount of tasks that can be executed at the same time.

    The default value is "10", meaning as many as 10 HTTP requests can take place in parallel. Consider reducing the value if you are experiencing HTTP 401 Unauthorized errors. Setting the value to "1" effectively disables the concurrency level, so that HTTP requests will be run in a synchronous manner, instead of in parallel.

    Example "concurrencyLevel": 5

    dependentSourceIds

    Option to specify data source dependencies for the sharing of database models. Sharing database models allows you to provide table-definition details from an independent data source to a data source that is dependent on those details. This is needed to avoid analysis errors and to have a complete lineage that includes lineage from the SQL statements from dependent data sources. For complete information, go to Sharing database models across data sources.

    If Database2 is dependent on Database1, include the dependentSourceIds property and specify the Source ID of Database1:

    "dependentSourceIds": ["<source ID of Database1>"]

    If, for example, Database2 is dependent on more than one data source, specify the independent sources, as follows:

    "dependentSourceIds": ["<source ID of an independent source>", "<source ID of another independent source>"]

    deleteRawMetadataAfterProcessing

    The lineage harvester harvests raw metadata from specified data sources and uploads it in a ZIP file to a Collibra Data Lineage service instance, for processing.

    You can use this optional property to specify whether or not the raw metadata should be deleted from Collibra Data Lineage service instance after the metadata that is targeted for ingestion in Data Catalog is processed.

    The default value is false.

    If the property is set to true, the raw source metadata is deleted after processing. If set to false, it is stored in the Collibra infrastructure.

    paging

    This property allows you to customize the Tableau API pagination settings.

    The default values are sufficient in most cases; however, you can decrease them to help mitigate node limit errors, or increase them to speed up API calls.

    If the integration fails because of timeout errors due to page sizing limits, Collibra Data Lineage automatically adjusts the limits and retries. For example, if failure occurs with worksheetsPageSize set to 100, the value is automatically reduced to 50 and another integration attempt is automatically started. If it fails again, the value is again halved. If integration is still unsuccessful with an adjusted value of 1, an error is thrown and no further attempts are started. If integration is eventually successful, the page size value is restored to its original value, in this example 100, for the next synchronization.

  2. Save the configuration file.

Steps

  1. Open the lineage-harvester.conf file that was created when you installed the lineage harvester, and enter the values for each property.

    Properties

    Description
    general

    This section describes the connection between Collibra Data Lineage and Data Catalog.

    techlin

    This section contains information that is necessary to connect to the Collibra Data Lineage service instance.

    Warning This section applies only to US government customers.

    url

    The URL of the Collibra Data Lineage service instance.

    Example “url”: “https://techlin-gov.collibra.com”

    Warning This section applies only to US government customers.

    userKey

    The unique API key to connect to the Collibra Data Lineage service instance.

    A unique user key is needed for each Collibra environment. If you're not sure what your user key is, please contact your Collibra Customer Success Manager.

    Warning This section applies only to US government customers.

    catalog

    This section contains information that is necessary to connect to Data Catalog.

    Note Versions of the lineage harvester older than 1.1.2 show collibra instead of catalog.

    url

    The URL of your Collibra environment.

    Note Enter the public URL of your Collibra environment. Other URLs are not accepted.

    username

    The username that you use to sign in to Collibra.

    useCollibraSystemName

    Indicates whether you want to use the system or server name of a data source to match to the System asset you created when you prepared the physical data layer. The names are case-sensitive. This is useful when you have multiple databases with the same name.

    sources

    This configuration section contains the required information for SQL files of a data source that were previously downloaded by the lineage harvester and is stored in the lineage harvester output folder.

    type

    The kind of data source. In this case, the value has to be LoadedSource.

    id

    This property is used to identify the batch of harvested metadata on the Collibra Data Lineage service instance. The value can be anything as long as it is unique and human readable. For example, my_loaded_snowflake_source.

    zipFile

    The full path to the ZIP file that was created in the lineage harvester folder.

    dependentSourceIds

    Option to specify data source dependencies for the sharing of database models. Sharing database models allows you to provide table-definition details from an independent data source to a data source that is dependent on those details. This is needed to avoid analysis errors and to have a complete lineage that includes lineage from the SQL statements from dependent data sources. For complete information, go to Sharing database models across data sources.

    If Database2 is dependent on Database1, include the dependentSourceIds property and specify the Source ID of Database1:

    "dependentSourceIds": ["<source ID of Database1>"]

    If, for example, Database2 is dependent on more than one data source, specify the independent sources, as follows:

    "dependentSourceIds": ["<source ID of an independent source>", "<source ID of another independent source>"]

    deleteRawMetadataAfterProcessing

    The lineage harvester harvests raw metadata from specified data sources and uploads it in a ZIP file to a Collibra Data Lineage service instance, for processing.

    You can use this optional property to specify whether or not the raw metadata should be deleted from Collibra Data Lineage service instance after the metadata that is targeted for ingestion in Data Catalog is processed.

    The default value is false.

    If the property is set to true, the raw source metadata is deleted after processing. If set to false, it is stored in the Collibra infrastructure.

  2. Save the configuration file.

What's next

Run the lineage harvester.