Prepare the lineage harvester configuration file

Before you can visualize the technical lineage or ingest a BI source, you have to create a configuration file for the (meta)data sources that you want to process. This configuration file is used by the lineage harvester to extract data from (meta)data sources for which you want to create a technical lineage or you want to ingest.

Note 
  • Technical lineage only supports a limited list of (meta)data sources.
  • In all lineage harvester files, you must use UTF-8 or ISO-8859-1 characters, with the exception of SQL files, which can only be UTF-8 encoded.
  • Each data source has an ID property. The ID string must be unique and human readable. The ID can be anything and is only used to identify the batch of metadata that is processed on the Collibra Data Lineage service.
  • The lineage harvester connects to different Collibra Data Lineage service instances based on your geographical location and cloud provider. Make sure you have the correct system requirements before you run the lineage harvester. If your location or cloud provider changes, the lineage harvester rescans all your data sources.
  • Technical lineage supports the following means of authentication:
    • For all data sources, except for external directories: username and password.
    • Google BigQuery data sources: username and password or a service account key file. For more information, see the Google BigQuery documentation.
    • Power BI: username and password or service principal.
    • Snowflake: username and password or key pair authentication.
    • Tableau: username and password or token-based authentication.
    • No other authentication methods are supported.
  • The lineage harvester does not support proxy server authentication, but you can manually connect to a proxy server via command line. For more information, see Connecting to a proxy server.
  • Comments in the lineage harvester configuration file are not supported.
  • If you upgrade to lineage harvester 1.3.0 or newer, you have to follow an upgrade procedure.
Tip For complete information on ingesting metadata from the following BI tools and creating a technical lineage, see the dedicated sections:

Before you begin

Requirements and permissions

  • A global role with the following global permissions:
    • Data Stewardship Manager
    • Manage all resources
    • System administration
    • Technical lineage
  • A resource role with the following resource permission on the community level in which you created the BI Data Catalog domain:
    • Asset: add
    • Attribute: add
    • Domain: add
    • Attachment: add
  • Necessary permissions to all database objects that the lineage harvester accesses.
    Tip 

    Some data sources require specific permissions.

    You need read access on the SYS schema.

    You need read access on the SYS schema and the View Definition Permission in your SQL Server.

    You need read access on information_schema:

    • bigquery.datasets.get
    • bigquery.tables.get
    • bigquery.tables.list
    • bigquery.jobs.create
    • bigquery.routines.get
    • bigquery.routines.list

    GRANT SELECT, at table level. Grant this to every table for which you want to create a technical lineage.

    The role of the user that you specify in the username property in lineage harvester configuration file must be the owner of the views in PostgreSQL.

    You need read access on information_schema. Only views that you own are processed.

    SELECT, at table level. Grant this to every table for which you want to create a technical lineage.

    A role with the LOGIN option.

    SELECT WITH GRANT OPTION, at Table level.

    CONNECT ON DATABASE

    You need a role that can access the Snowflake shared read-only database. To access the shared database, the account administrator must grant the IMPORTED PRIVILEGES privilege on the shared database to the user that runs the lineage harvester.

    Tip If the default role in Snowflake does not have the IMPORTED PRIVILEGES privilege, you can use the customConnectionProperties property in the lineage harvester configuration file to assign the appropriate role to the user. For example:
    "customConnectionProperties": "role=METADATA"

    You need read access on the DBC.

    You need read access to the following dictionary views:

    • all_tab_cols
    • all_col_comments
    • all_objects
    • ALL_DB_LINKS
    • all_mviews
    • all_source
    • all_synonyms
    • all_views

    You need read access on definition_schema.

    You need Admin permission on all objects that you want to harvest.

    You have added the Matillion certificate to a Java truststore.

    You have at least a Matillion Enterprise license.

    You need a role with user access to the server from which you want to ingest:

    • You have a system-level role, which is at least a System user role.
    • You have an item-level role, which is at least a Content Manager role.

    You need a role with user access to the relevant server and be able to access the metadata that is stored there.

    Make sure that the lineage harvester can reach Power BI by registering Power BI in Azure and setting the necessary permission to harvest the metadata.

    We highly recommend that you read about supported authentication methods before you register Power BI in Microsoft Azure. For more details, see Register Power BI in Microsoft Azure and set permissions.

    You need to following minimum roles and permissions to harvest Tableau metadata:

    • You have a View permission on Tableau projects, workbooks and data sources you want to ingest.
    • You have a Viewer or Explorer (can publish) role with access to the Tableau REST API.

    For a full ingestion, we recommend the following roles and permissions in Tableau:

    • You have at least a View permission on Tableau projects, workbooks and data sources you want to ingest.
    • You have the Explorer role with the Data Management Add-on.

Steps

  1. Start the lineage harvester to create an empty lineage harvester configuration file by entering the following command:
    • Windows: .\bin\lineage-harvester.bat
    • For other operating systems: chmod +x bin/lineage-harvester and then bin/lineage-harvester
    An empty configuration file is created in the config folder.
  2. Open the configuration file and enter the values for each property.
    Tip 

    Use these options to filter the rows of the table to your needs.

    Supported integrations:


    Tip You can use the configuration file generator to create an example configuration file with the properties of your choosing. You can easily copy this example to your configuration file and replace the values of the properties to match your data source information.

    Properties

    Description
    general

    This section describes the connection between Collibra lineage and Data Catalog.

    catalog

    This section contains information that is necessary to connect to Data Catalog.

    Note Versions of the lineage harvester older than 1.1.2 show collibra instead of catalog.

    url

    The URL of your Collibra environment.

    Note Enter the public URL of your Collibra environment. Other URLs are not accepted.

    username

    The username that you use to sign in to Collibra.

    useCollibraSystemName

    Indicates whether you want to use the system or server name of a data source to match to the System asset you created when you prepared the physical data layer. This is useful when you have multiple databases with the same name.

    By default, the useCollibraSystemName property is set to false.

    Note  For SQL data sources, if this property is:
    • false, system or server names in table references in analyzed SQL code are ignored. This means that a table that exists in two different systems or servers is identified (either correctly or incorrectly) as a single data object, with a single asset full name.
    • true, system or server names in table references are considered to be represented by different System assets in Data Catalog. The value of the collibraSystemName property is used as the default system or server name.

    Indicates whether or not you want to use the system or server name of a data source to match to the System asset in Data Catalog. Collibra Data Lineage uses the system names to match the structure of databases in Looker to assets in Data Catalog. This is useful when you have multiple databases with the same name.

    By default, the useCollibraSystemName property is set to false. If you want to use it, set it to true.

    Important 
    • If you set this property to true, the lineage harvester reads the value of the collibraSystemName property in your Looker <source-ID> configuration file.
    • If you set the useCollibraSystemName property to false, the lineage harvester ignores the collibraSystemName property in the Looker <source-ID> configuration file.

    Indicates whether or not you want to use the system or server name of a data source to match to the System asset in Data Catalog during automatic stitching. This is useful when you have multiple databases with the same name or if you want to specify the Power BI workspaces from which you want to ingest.

    By default, this property is set to False. If you want to use it, set it to True.

    Important 
    • If you set this property to true, the lineage harvester reads the value of the collibraSystemName property in your Power BI <source ID> configuration file.
    • If you set the useCollibraSystemName property to false, the lineage harvester ignores the collibraSystemName property in the Power BI <source-ID> configuration file.

    Indicates whether or not you want to use the system or server name of a data source to match to the System asset you created when you prepared the physical data layer. This is useful when you have multiple databases with the same name.

    By default, this property is set to false. If you want to use it, set it to true.

    Important 
    • If you set this property to true, the lineage harvester reads the value of the collibraSystemName property in your Tableau <source-ID> configuration file.
    • If you set the useCollibraSystemName property to false, the lineage harvester ignores the collibraSystemName property in the <source-ID> configuration file.

    Note If you set the useCollibraSystemName property to true, but you don't define the system name in the Tableau <source ID> configuration file, the system name in the technical lineage is DEFAULT.

    Indicates whether or not you want to use the system or server name of a data source to match to the System asset you created when you prepared the physical data layer. This is useful when you have multiple databases with the same name.

    By default, this property is set to false. If you want to use it, set it to true.

    Important 
    • If you set this property to true, the lineage harvester reads the value of the collibraSystemName property in your SSRS-PBRS <source-ID> configuration file.
    • If you set the useCollibraSystemName property to false, the lineage harvester ignores the collibraSystemName property in the <source-ID> configuration file.

    By default, this property is set to false. This property is not valid for MicroStrategy integration. We recommend that you leave it set to false.

    Indicates whether or not you intend to use a Matillion <source ID> configuration file to specify the system name of a data source. This is useful if you have multiple databases with the same name, or if you want to group a number of databases under one system.

    By default, this property is set to false.

    If you set this property to true, you must prepare a Matillion <source ID> configuration file.

    Indicates whether or not you want to use the system or server name of a JDBC data source to match the System asset that you created when you prepared the physical data layer.

    Specify one of the following values:

    false
    The lineage harvester ignores all system or server names that you specify on the collibraSystemName properties in the configuration file. This is the default value.
    true
    The lineage harvester reads the system and server names that you specify on the collibraSystemName properties in all sections of the configuration file. Only specify this value when you have multiple databases with the same name.
    Note Specify this property with the value of true only when you have multiple databases with the same name.
    sources

    This section describes the data sources for which you want to create the technical lineage. You have to create a configuration section for each data source.

    Note You can add multiple data sources to the same configuration file.

    <SQL directory properties>

    This configuration section contains the required information of one individual SQL directory with connection type "Folder".

    id

    The unique ID of the data source. For example, my_first_data_source.

    type

    The kind of data source. In this case, the value has to be SqlDirectory.

    path

    The full path to the SQL directory.

    mask

    The pattern of the file names in the directory. By default, this is *.

    recursive

    Indication of the files you want to harvest:

    • false (default): Only harvest the files in directly under the folder in the SQL directory path.
    • true: Harvest all files under the folder in the SQL directory path and subdirectories.
    dialect
    The dialect of the database.
    database

    The name of your database, which is the full name of your Database asset.

    Note You have to use the same database name as the full name of the Database asset that you create when you prepare the physical data layer in Data Catalog.
    Important 

    HiveQL, MySQL and Teradata data sources don't have schemas. Therefore, HiveQL, MySQL and Teradata databases are stored in Data Catalog and technical lineage as Schema assets. The technical lineage Browse tab pane shows the following names:

    • For HiveQL and Teradata:
      • The database name is the name that you enter for the collibraSystemName property.
      • The schema name is the name that you enter for the database property.
    • For MySQL:
      • The database name is the name that you enter for the database property.
    externalDbName

    This property can be considered a means of database mapping, to help preserve stitching. It is relevant only for HiveQL, MySQL and Teradata data sources, specifically because they are database-less data sources.

    You can add the key/value pair to the configuration file as follows: "externalDbName": "CDATA"

    collibraSystemName

    The name of the data source's system or server. This is also the full name of your System asset in Data Catalog.

    Specify this property with the same name as the full name of the System asset that you create when you prepare the physical data layer in Data Catalog. If you don't prepare the physical data layer, Collibra Data Lineage cannot stitch the data objects in your technical lineage to the assets in Data Catalog.

    schema

    The name of the default schema, if not specified in the data source itself. This corresponds to name of your Schema asset.

    Note You must use the same schema name as the name of the Schema asset that you create when you prepare the physical data layer in Data Catalog.
    verbose

    Indication whether you want to enable verbose logging.

    By default this is set to True. If you don't want to use verbose logging, set it to False.

    deleteRawMetadataAfterProcessing

    The lineage harvester harvests metadata from specified data sources and uploads it in a ZIP file to a Collibra Data Lineage service instance, for processing.

    You can use this optional property to specify whether or not the raw metadata should be deleted after it has been processed.

    The default value is false.

    If the property is set to true, the raw metadata is deleted after processing. If set to false, it is stored in an Amazon S3 bucket.

    Note 
    • Setting this property to true can negatively impact performance.
    • This property is not yet supported by the technical lineage backend, so it can't be used yet. Backend support is coming soon.
    <External directories>

    This configuration section contains the required information to connect to the following data sources:

    • Informatica PowerCenter
    • SQL Server Integration Services (SSIS).
    • IBM InsfoSphere DataStage

    Note Make sure that you have prepared a local folder with the Informatica objects, SSIS files or DataStage files for which you want to create a technical lineage.

    collibraSystemName

    The name of the data source's system or server. If the useCollibraSystemName property is set to true, you must prepare a configuration file to provide the system information.

    id

    The unique ID of your data source. For example, my_informatica.

    type

    The kind of data source. In this case, the value has to be ExternalDirectory.

    dirType

    The type of external directory. The value has to be one of the following:

    • infa, for an Informatica PowerCenter data source.
    • ssis, for a SQL Server Integration Service data source.
    • datastage, for a IBM InfoSphere DataStage source.
    path

    The full path to the folder where you stored the data source.

    mask

    The pattern of the file names in the directory. By default, this is *.

    recursive

    Indication whether you want to use recursive queries.

    By default, this is set to False. If you want to use recursive query, set it to True.

    deleteRawMetadataAfterProcessing

    The lineage harvester harvests metadata from specified data sources and uploads it in a ZIP file to a Collibra Data Lineage service instance, for processing.

    You can use this optional property to specify whether or not the raw metadata should be deleted after it has been processed.

    The default value is false.

    If the property is set to true, the raw metadata is deleted after processing. If set to false, it is stored in an Amazon S3 bucket.

    Note 
    • Setting this property to true can negatively impact performance.
    • This property is not yet supported by the technical lineage backend, so it can't be used yet. Backend support is coming soon.
    <Informatica Intelligent Cloud Services Data Integration>

    This configuration section contains the required information to enable the lineage harvester to collect and process Data Integration objects.

    You can create different Informatica Intelligent Cloud Services <source ID> configuration files for a large data source to avoid errors that might occur when the lineage harvester ingests metadata from one source with a large size. You can then decrease the size of the source by separating the projects to a different source with a different <source ID> configuration file name.

    Tip Make sure you have READ permission on all data objects that you want to harvest.

    type

    The kind of data source. In this case, the value has to be IICS.

    id

    The unique ID that is used to identify the data source on the Collibra Data Lineage service. For example, my_data_integration.

    collibraSystemName

    The name of the Informatica server or system.

    Important You must prepare a <source ID> configuration file to provide this system information. This is true regardless of whether the useCollibraSystemName property is set to true or false.

    loginURL

    The URL of the Informatica Intelligent Cloud Services environment sign-in page. For example: https://dm-us.informaticaintelligentcloud.com.

    username

    The username you use to sign in to Informatica Intelligent Cloud Services.

    objects

    The objects that you want to export. Each object requires a path and a type, for example:

    "objects": [
    	{
    		"path" : "Sales",
    		"type" : "Project"
    	}, 
    	{
    		"path" : "Finance/Task_Flows",
    		"type" : "Folder"
    	},
    	{
    		"path" : "Common/Task_Flows/tf_CalendarDimension",
    		"type" : "Taskflow"
    	}
    ]

    The following section provides information to identify and access Data Integration objects.

    Tip For more information about the objects that you can export and the required information, see the Informatica documentation.

    path

    The full path to the object.

    type

    The type of the object. For example, Taskflow.

    IICS scanner's starting point is a Taskflow. Therefore the only meaningful types to export are: Taskflow, Project and Folder.

    Note The types are not case sensitive.

    paramFiles

    The full path to the directory in which your parameter files are stored.

    This is an optional parameter that allows you to harvest parameter files in Informatica Intelligent Cloud Services data sources.

    Important The hierarchy of the files in the directory must be an exact match of the hierarchy of the files in your file system.
    deleteRawMetadataAfterProcessing

    The lineage harvester harvests metadata from specified data sources and uploads it in a ZIP file to a Collibra Data Lineage service instance, for processing.

    You can use this optional property to specify whether or not the raw metadata should be deleted after it has been processed.

    The default value is false.

    If the property is set to true, the raw metadata is deleted after processing. If set to false, it is stored in an Amazon S3 bucket.

    Note 
    • Setting this property to true can negatively impact performance.
    • This property is not yet supported by the technical lineage backend, so it can't be used yet. Backend support is coming soon.

    <Matillion>

    This section contains the required information for Matillion.

    Tip When you create a new project in Matillion, you define in which group you want to create the project, the project name and the environment name. This information is needed to enable the lineage harvester to access Matillion and scan your metadata.

    Important Currently, you can only create a technical lineage for Snowflake and Redshift projects in Matillion.

    id

    The unique ID that is used to identify the data source on the Collibra Data Lineage service instance. For example, my_matillion_data_integration.

    type

    The kind of data source. In this case, the value has to be Matillion.

    url

    The URL of your Matillion environment. For example, https://<domain name> or https://<IP address>.

    groupName

    The name of your group in Matillion.

    projectName

    The name of your project in Matillion.

    You can only add the name of one project. If you want to create a technical lineage for other projects within the same group, create a new section in the lineage harvester configuration file.

    environmentName

    The name of your environment in Matillion.

    You can only add the name of one environment. If you want to create a technical lineage for other environments within the same project, create a new section in the lineage harvester configuration file.

    dialect

    The dialect of the database.

    You can enter one of the following values:

    • redshift, for an Amazon Redshift data source.
    • snowflake, for a Snowflake data source.
    startTimestamp

    The timestamp of tasks in Matillion. You can use this parameter to limit the amount of metadata that the lineage harvester scans.

    If the startTimestamp field remains empty or is deleted from the configuration file, all accessible tasks are scanned.

    Matillion automatically removes entries older than seven days.

    collibraSystemName

    Regardless of the value set for the useCollibraSystemName property, the following is true:

    • You must include this property in your configuration file.
    • You can leave this property empty.
    • Any value that you give is ignored.

    If the useCollibraSystemName property is set to true, you must prepare a Matillion <source-ID> configuration file. In that case, the CollibraSystemName property in the <source ID> configuration file is taken into account.

    Note This is a legacy property that will be deprecated in a future release.

    auth

    The section contains the authentication details for signing in to Matillion.

    type

    The authentication method you want to use to sign in to Matillion.

    The value must be either:

    • Basic, for username and password authentication.
    • Token, for token-based authentication.

    Important These values are case-sensitive.

    username

    The username that you use to sign in to Matillion.

    Important This property is only required if you are using the username and password authentication method. If you are using token-based authentication, do not include this property.

    deleteRawMetadataAfterProcessing

    The lineage harvester harvests metadata from specified data sources and uploads it in a ZIP file to a Collibra Data Lineage service instance, for processing.

    You can use this optional property to specify whether or not the raw metadata should be deleted after it has been processed.

    The default value is false.

    If the property is set to true, the raw metadata is deleted after processing. If set to false, it is stored in an Amazon S3 bucket.

    Note 
    • Setting this property to true can negatively impact performance.
    • This property is not yet supported by the technical lineage backend, so it can't be used yet. Backend support is coming soon.
    <Custom lineage>

    This section contains the required information to retrieve a custom lineage. To create a custom technical lineage, use this property to locate the JSON file that defines the custom technical lineage.

    The JSON file must be named lineage.json. Ensure that you have prepared a local folder with the lineage.json file.

    Note In the local folder that you need to create, you can only have one JSON file. You can, however, add other files in the harvested directory and subdirectories and refer to those files from within the JSON file.

    collibraSystemName

    The system or server name of the data source. You can use this property to distinguish data objects with the same name.

    id

    The unique ID of your custom technical lineage, for example, MyCustomLineage.

    type

    The kind of data source. The value must be ExternalDirectory.

    dirType

    The type of external directory. The value is custom-lineage.

    path

    The full path to the folder of the JSON file that contains the custom technical lineage definition. There must be only one JSON file that defines the lineage, and the JSON file must be named lineage.json.

    deleteRawMetadataAfterProcessing

    The lineage harvester harvests metadata from specified data sources and uploads it in a ZIP file to a Collibra Data Lineage service instance, for processing.

    You can use this optional property to specify whether or not the raw metadata should be deleted after it has been processed.

    The default value is false.

    If the property is set to true, the raw metadata is deleted after processing. If set to false, it is stored in an Amazon S3 bucket.

    Note 
    • Setting this property to true can negatively impact performance.
    • This property is not yet supported by the technical lineage backend, so it can't be used yet. Backend support is coming soon.
    <database properties>

    This configuration section contains the required information of one individual data source with connection type "JDBC".

    id

    The unique ID of your data source. For example, my_second_data_source.

    type

    The kind of data source. In this case, the value has to be Database.

    username

    The username that you use to sign in to your data source.

    dialect

    The dialect of the database.

    databaseNames

    The names or IDs of your databases.

    Enter the database names of your data source between double quotes ("") and put everything between square brackets. If you want to include more than one database, separate them by a comma. For example, ["MyFirstDatabase", "MySecondDatabase"].

    Note Ensure that you use the same database names as the full names of the Database ass
    Important 

    HiveQL, MySQL and Teradata data sources don't have schemas. Therefore, HiveQL, MySQL and Teradata databases are stored in Data Catalog and technical lineage as Schema assets. The technical lineage Browse tab pane shows the following names:

    • For HiveQL and Teradata:
      • The database name is the name that you enter for the collibraSystemName property.
      • The schema name is the name that you enter for the database property.
    • For MySQL:
      • The database name is the name that you enter for the database property.

    Workaround for Oracle

    When ingesting an Oracle data source, the value of the databaseNames property in your configuration file must be either the Oracle SID or service name, depending on whether you set the connectAsServiceName property to true or false. This means that the database in the technical lineage will have the name of the Oracle SID or service name. However, if the database asset in Data Catalog reflects the true name of the database, stitching will break. To resolve this issue and preserve stitching, you need to rename the database asset in Data Catalog to match the value you put for the databaseNames property. This is a known issue that we will fix in a future version of Collibra.

    Tip To avoid this workaround, you can use the "type": "DatabaseOracle" and related properties in your configuration file. That allows you to specify the Oracle database name and preserve stitching in cases where the database name is not the same as the SID or service name.

    externalDbName

    This property can be considered a means of database mapping, to help preserve stitching. It is relevant only for HiveQL, MySQL and Teradata data sources, specifically because they are database-less data sources.

    You can add the key/value pair to the configuration file as follows: "externalDbName": "CDATA"

    hostname

    The name of your database host.

    collibraSystemName

    The name of the data source's system or server. This is also the full name of your System asset in Data Catalog.

    Specify this property with the same name as the full name of the System asset that you create when you prepare the physical data layer in Data Catalog. If you don't prepare the physical data layer, Collibra Data Lineage cannot stitch the data objects in your technical lineage to the assets in Data Catalog.

    If the useCollibraSystemName property is:

    • false (default), system or server names in table references in analyzed SQL code are ignored. This means that a table that exists in two different systems or servers is identified (either correctly or incorrectly) as a single data object, with a single asset full name.
    • true, system or server names in table references are considered to be represented by different System assets in Data Catalog. The value of the collibraSystemName field is used as the default system or server name.

    port

    The port number.

    customConnectionProperties

    An option to enable the lineage harvester to read additional connection parameters. This parameter is only required in very specific situations. If you don't need it, you can remove it from the configuration file.

    deleteRawMetadataAfterProcessing

    The lineage harvester harvests metadata from specified data sources and uploads it in a ZIP file to a Collibra Data Lineage service instance, for processing.

    You can use this optional property to specify whether or not the raw metadata should be deleted after it has been processed.

    The default value is false.

    If the property is set to true, the raw metadata is deleted after processing. If set to false, it is stored in an Amazon S3 bucket.

    Note 
    • Setting this property to true can negatively impact performance.
    • This property is not yet supported by the technical lineage backend, so it can't be used yet. Backend support is coming soon.
    <Oracle>

    This configuration section contains the required information for an Oracle database.

    Tip We recommend the "type": "DatabaseOracle" configuration described in this section, because it allows you to specify the Oracle database name and preserve stitching in cases where the database name is not the same as the SID or service name. You can, however, still use the legacy "type": "Database" configuration to ingest Oracle databases.

    id

    The unique ID of your Oracle database. For example, my_oracle_db.

    type

    The kind of data source. In this case, the value has to be DatabaseOracle.

    hostname
    The name of your database host.
    username

    The username that you use to sign in to your Oracle database.

    port
    The port number.
    sids

    One or more system identifiers (SID). An SID is a unique name for an Oracle database instance on a specific host. You can use this property in conjunction with the databaseNames property, to preserve stitching.

    Important You must specify either one or more SIDs via this property, or one or more service names via the serviceNames property. You cannot include both properties in the configuration file.
    serviceNames

    One or more service names. A service name is the TNS alias that you give when you remotely connect to your database. You can use this property in conjunction with the databaseNames property, to preserve stitching.

    Important You must specify either one or more service names via this property, or one or more SIDs via the sids property. You cannot include both properties in the configuration file.
    databaseNames

    The names of one or more Oracle databases. You can use this optional property in conjunction with the sids or serviceNames property, to preserve stitching. The value you specify has to match your Database asset (or assets) in Collibra.

    Enter the Oracle database names between double quotes ("") and put everything between square brackets. If you want to include more than one database, separate them by a comma. For example, ["MyFirstDatabase", "MySecondDatabase"].

    • If you use this property, the database names that you specify have to correlate with the databases that you specify in the sids or serviceNames property.
    • If you don't use this property, the database name in the technical lineage will be the value that you put for the sids or serviceNames property.

    Tip For examples of how to configure this property, see the sids or serviceNames property descriptions and examples.

    collibraSystemName

    The name of the data source's system or server. This is also the full name of your System asset in Data Catalog.

    Specify this property with the same name as the full name of the System asset that you create when you prepare the physical data layer in Data Catalog. If you don't prepare the physical data layer, Collibra Data Lineage cannot stitch the data objects in your technical lineage to the assets in Data Catalog.

    If the useCollibraSystemName property is:

    • false (default), system or server names in table references in analyzed SQL code are ignored. This means that a table that exists in two different systems or servers is identified (either correctly or incorrectly) as a single data object, with a single asset full name.
    • true, system or server names in table references are considered to be represented by different System assets in Data Catalog. The value of the collibraSystemName field is used as the default system or server name.

    deleteRawMetadataAfterProcessing

    The lineage harvester harvests metadata from specified data sources and uploads it in a ZIP file to a Collibra Data Lineage service instance, for processing.

    You can use this optional property to specify whether or not the raw metadata should be deleted after it has been processed.

    The default value is false.

    If the property is set to true, the raw metadata is deleted after processing. If set to false, it is stored in an Amazon S3 bucket.

    Note 
    • Setting this property to true can negatively impact performance.
    • This property is not yet supported by the technical lineage backend, so it can't be used yet. Backend support is coming soon.
    <Google BigQuery database>

    This configuration section contains the required information for a Google BigQuery database.

    id

    The unique ID of your data source. For example, my_third_data_source.

    type

    The kind of data source. In this case, the value has to be DatabaseBigQuery.

    projectIDs

    The IDs of your Google BigQuery project. You can add multiple projects. For example, [ "first-project", "second-project", "third-project" ].

    Note You have to use the same project ID as the full name of the Database asset that you create when you prepare the physical data layer in Data Catalog.
    region

    The location of your BigQuery data. This is the region that you specified when you create a data set.

    You can only add one location as value. However, you can create separate BigQuery entries per location in the configuration file. As a result, you create a complete technical lineage with Google BigQuery data from different locations.

    Note This property is optional.

    auth

    The path to a JSON file that contains authentication information.

    Tip For more information about setting up the authentication, see the Google Big Query user guide.

    collibraSystemName

    The name of the Google BigQuery system. This is also the full name of your System asset in Data Catalog.

    Specify this property with the same name as the full name of the System asset that you create when you prepare the physical data layer in Data Catalog. If you don't prepare the physical data layer, Collibra Data Lineage cannot stitch the data objects in your technical lineage to the assets in Data Catalog.
    deleteRawMetadataAfterProcessing

    The lineage harvester harvests metadata from specified data sources and uploads it in a ZIP file to a Collibra Data Lineage service instance, for processing.

    You can use this optional property to specify whether or not the raw metadata should be deleted after it has been processed.

    The default value is false.

    If the property is set to true, the raw metadata is deleted after processing. If set to false, it is stored in an Amazon S3 bucket.

    Note 
    • Setting this property to true can negatively impact performance.
    • This property is not yet supported by the technical lineage backend, so it can't be used yet. Backend support is coming soon.
    sources

    This section contains the Snowflake connection properties. If you want to create the technical lineage for multiple data sources, create a sources section for each data source.

    id

    The unique ID that identifies the data source on a Collibra Data Lineage service instance, for example, my_snowflake_2.

    type

    The kind of data source. The value must be DatabaseSnowflake.

    mode

    The Snowflake ingestion method.

    Specify one of the following values:

    SQL
    Collibra Data Lineage uses the SQL mode Snowflake ingestion method to ingest metadata from Snowflake data sources. This is the default value.
    SQL-API
    Collibra Data Lineage uses the SQL-API mode Snowflake ingestion (beta) method to ingest metadata from Snowflake data sources.

    For more information, go to Technical lineage for Snowflake ingestion methods.

    collibraSystemName

    The system or server name of the data source.

    This property is optional. Use this property with the useCollibraSystemName property to override the default Collibra System asset name for this data source.

    Specify this property with the same name as the full name of the System asset that you create when you prepare the physical data layer in Data Catalog. If you don't prepare the physical data layer, Collibra Data Lineage cannot stitch the data objects in your technical lineage to the assets in Data Catalog.
    auth

    This section indicates the authentication details to connect to the Snowflake database.

    Note The username and auth properties are mutually exclusive.

    type

    The authentication method.

    Specify one of the following values. The values are case-sensitive.

    Basic
    The username and password authentication method. Specify the auth.username property if you use this authentication method.
    KeyPair
    The key pair authentication method. Specify the auth.pathToPrivateKey and auth.usePassword properties if you use this authentication method.
    username
    The user name that you use to connect to the Snowflake database.
    pathToPrivateKey

    The path to your private key file. This property is required if you use the key pair authentication method.

    Ensure that the private key matches the public key; otherwise, an error occurs indicating that the JWT token is invalid. For more information about the error, go to Snowflake JDBC driver error at login: net.snowflake.client.jdbc.SnowflakeSQLException: JWT token is invalid in Collibra Support Portal.

    usePassword

    The private key file password.

    This property is required if you use the key pair authentication method. Specify one of the following values:

    true
    The password is required.
    false
    The password is not required. This is the default value.
    username

    The username that you use to sign in to your Snowflake data source.

    Note This property is deprecated. Use the property instead. The property and the property are mutually exclusive.

    hostname

    The URL that you use to access Snowflake web console. When you enter the URL, do not include https:// or the trailing slash (/). For example, specify <accountName>.snowflakecomputing.com.

    databaseNames

    The names of your databases.

    Specify this property with the same databases names as the full names of the Database assets that you create when you prepare the physical data layer in Data Catalog.

    Enter the database names of your data source between double quotes ("") and put everything between square brackets ([]). If you want to include more than one database, separate them by a comma, for example, ["MyFirstSnowflakeDatabase", "MySecondSnowflakeDatabase"].

    warehouse

    The name of your virtual warehouse. This property is optional.

    customConnectionProperties

    An option to enable the lineage harvester to read additional connection parameters. This parameter is only required in very specific situations. If you don't need it, you can remove it from the configuration file.

    Example If you get an OSCP scan error, you can turn OSCP checking off by using the following value: insecureMode=true.

    deleteRawMetadataAfterProcessing

    The lineage harvester harvests metadata from specified data sources and uploads it in a ZIP file to a Collibra Data Lineage service instance, for processing.

    You can use this optional property to specify whether or not the raw metadata should be deleted after it has been processed.

    The default value is false.

    If the property is set to true, the raw metadata is deleted after processing. If set to false, it is stored in an Amazon S3 bucket.

    Note 
    • Setting this property to true can negatively impact performance.
    • This property is not yet supported by the technical lineage backend, so it can't be used yet. Backend support is coming soon.
    <SQL files in the lineage harvester output folder>

    This configuration section contains the required information for SQL files of a data source that were previously downloaded by the lineage harvester and is stored in the lineage harvester output folder.

    type

    The kind of data source. In this case, the value has to be LoadedSource.

    id

    The unique ID of the data source that you uploaded to the lineage harvester folder. For example, my_loaded_snowflake_source.

    zipFile

    The full path to the ZIP file that was created in the lineage harvester folder.

    deleteRawMetadataAfterProcessing

    The lineage harvester harvests metadata from specified data sources and uploads it in a ZIP file to a Collibra Data Lineage service instance, for processing.

    You can use this optional property to specify whether or not the raw metadata should be deleted after it has been processed.

    The default value is false.

    If the property is set to true, the raw metadata is deleted after processing. If set to false, it is stored in an Amazon S3 bucket.

    Note 
    • Setting this property to true can negatively impact performance.
    • This property is not yet supported by the technical lineage backend, so it can't be used yet. Backend support is coming soon.
    <Tableau>

    This configuration section contains the required information for Tableau integration.

    sources

    This section contains all Tableau connection properties.

    type

    The kind of data source. In this case, the value has to be Tableau.

    id

    The unique ID to identify the Tableau metadata that was uploaded to the Collibra Data Lineage.

    Tip This value can be anything as long as it is a unique. The lineage harvester uses the ID to identify a batch of data on the Collibra Data Lineage service instance.

    url

    The link to the data in Tableau.

    username

    The username you use to sign in to the Tableau server.

    Warning As of October 2022, Tableau is enforcing multi-factor authentication for Tableau Cloud Admin users. However, the lineage harvester doesn’t support multi-factor authentication. Therefore, Tableau Cloud users with an Admin role must use token-based authentication. This does not affect Tableau Server users or Tableau Cloud users with an Explorer role.

    Important If you want to use token-based authentication, you need to replace username with tokenName. You must specify either username or tokenName; if both exist, then tokenName is used.

    tokenName

    The lineage harvester authentication token.

    Note For token-based authentication, use this property in your lineage harvester configuration file, instead of the username property. If both properties are present, tokenName is used.

    siteIds

    The site IDs of the Tableau sites that you want to include in the ingestion process.

    Warning Ensure that you specify the correct value. The correct value is the URL of the site to which you want to sign in. When you manually sign in to Tableau Server or Tableau Online, the site ID is the value that appears after /site/ in the browser address bar. In the following example URLs, the site ID is MarketingTeam:
    • Tableau Server: http://MyServer/#/site/MarketingTeam/projects
    • Tableau Online: https://10ay.online.tableau.com/#/site/MarketingTeam/workbooks

    On Tableau Server, however, the URL of the Default site does not specify the site. For example, the URL for a view named Profits, on a site named Sales, is http://localhost/#/site/sales/views/profits. The URL for this same view on the Default site is http://localhost/#/views/profits. The site name Sales does not figure in the URL. If you can't see the site ID, leave this property empty: "siteIds": [""]
    Example If you want to ingest two Tableau sites "Site 1" and "Site 2", you can enter the following information in the siteIds property: ["site ID of Site 1", "site ID of Site 2"].
    siteNames

    The site names of the corresponding site IDs.

    Important This property is:
    • Optional for Tableau Server
    • Mandatory for Tableau Online.
    Warning If you have Tableau Server and you don't use this property, you must delete it from your configuration file. Don't leave the property in the configuration file without a value.
    restOnly

    Indication whether or not you would like to use both the Tableau REST API and Tableau Metadata API to harvest Tableau metadata.

    • false (default): The lineage harvester will use the REST API and Metadata API to harvest Tableau metadata.
    • true: The lineage harvester will only use the REST API to harvest Tableau metadata.

    Warning If you only allow the lineage harvester to use the Tableau REST API, the harvester won't be able to process the necessary information for the technical lineage and the automatic stitching of Column assets to Tableau Data Attribute assets will not be possible.

    domainId

    The unique reference ID of the domain in Collibra Data Intelligence Cloud in which you want to ingest the Tableau assets.

    excludeImages

    Optional parameter for excluding the downloading of images.

    To exclude the downloading of images, set this property to true.

    deleteRawMetadataAfterProcessing

    The lineage harvester harvests metadata from specified data sources and uploads it in a ZIP file to a Collibra Data Lineage service instance, for processing.

    You can use this optional property to specify whether or not the raw metadata should be deleted after it has been processed.

    The default value is false.

    If the property is set to true, the raw metadata is deleted after processing. If set to false, it is stored in an Amazon S3 bucket.

    Note 
    • Setting this property to true can negatively impact performance.
    • This property is not yet supported by the technical lineage backend, so it can't be used yet. Backend support is coming soon.
    paging

    Optional parameter for customizing the Tableau API pagination settings.
    The default values are sufficient in most cases; however, you can decrease them to help mitigate node limit errors, or increase them to speed up API calls.

    <Power BI (deprecated)>

    This configuration section contains the required information for Power BI integration.

    Note 
    • You have to purchase the Power BI connector and lineage feature. Then you need to add the Power BI connection properties to both the lineage harvester configuration file and the Power BI harvester configuration file to ingest Power BI metadata into Data Catalog.
    • This integration method is deprecated. We will continue to fix issues, but the development of new features and improvements is discontinued.
    type

    The kind of data source. In this case, the value has to be ExistingLineage.

    id

    The unique ID of the Power BI metadata you harvested via the Power BI harvester.

    You must use the same ID as the value you used in the Power BI configuration file sourceID property.

    <Looker>

    This configuration section contains the required information for Looker integration.

    id

    The unique ID of your Looker metadata. For example, my_looker.

    Tip This value can be anything as long as it is unique and human readable. The ID identifies the batch of Looker metadata on the Collibra Data Lineage service.

    type

    The kind of data source. In this case, the value has to be Looker.

    lookerUrl

    The URL to your Looker API.

    Tip There are two ways to find the Looker API URL:
    • In the API Host URL field in the Looker Admin menu. If this field is empty, you can use the default Looker API URL which you can find in the interactive API documentation.
    • In the interactive API documentation URL. It is the part of the URL before /api-docs/.
    clientId

    The username you use to access the Looker API.

    domainId

    The unique ID of the domain in Collibra Data Intelligence Cloud in which you want to ingest the Looker assets.

    deleteRawMetadataAfterProcessing

    The lineage harvester harvests metadata from specified data sources and uploads it in a ZIP file to a Collibra Data Lineage service instance, for processing.

    You can use this optional property to specify whether or not the raw metadata should be deleted after it has been processed.

    The default value is false.

    If the property is set to true, the raw metadata is deleted after processing. If set to false, it is stored in an Amazon S3 bucket.

    Note 
    • Setting this property to true can negatively impact performance.
    • This property is not yet supported by the technical lineage backend, so it can't be used yet. Backend support is coming soon.
    <MicroStrategy>
    This configuration section contains the required information for MicroStrategy integration.
    type

    The kind of data source. In this case, the value has to be MicroStrategy.

    id

    The unique ID of your MicroStrategy metadata. For example, my_microstrategy.

    Tip This value can be anything as long as it is unique and human readable. The ID identifies the batch of MicroStrategy metadata on the Collibra Data Lineage service instance.

    domainId

    The unique reference ID of the domain in Collibra Data Intelligence Cloud in which you want to ingest the MicroStrategy assets.

    username
    The username that you use to sign in to MicroStrategy.
    hostname

    The endpoint that you use to access the PostgreSQL repository or remote data source, depending on where you installed the lineage harvester.

    For example remote.postgres.com.

    port
    The port number.
    databaseName

    Optionally, the name of your database. For example poc_metadata.

    deleteRawMetadataAfterProcessing

    The lineage harvester harvests metadata from specified data sources and uploads it in a ZIP file to a Collibra Data Lineage service instance, for processing.

    You can use this optional property to specify whether or not the raw metadata should be deleted after it has been processed.

    The default value is false.

    If the property is set to true, the raw metadata is deleted after processing. If set to false, it is stored in an Amazon S3 bucket.

    Note 
    • Setting this property to true can negatively impact performance.
    • This property is not yet supported by the technical lineage backend, so it can't be used yet. Backend support is coming soon.
    <SQL Server Reporting Services and Power BI Report Server>

    This configuration section contains the required information for SQL Server Reporting Services and Power BI Report Server integration.

     

    id

    The unique ID to identify the SSRSmetadata that was uploaded to the Collibra Data Lineage service.

    Tip This value can be anything as long as it is a unique. The lineage harvester uses the ID to identify a batch of data on the Collibra Data Lineage service.

    type
    The kind of data source. In this case, the value has to be SSRS or PBIRS.

    Note There is no difference between type SSRS or PBIRS.

    url

    The URL to the server's web portal. By default, the URL is http://<computer-name>/reports. For example, "http://1.23.45.678/PowerBIReports".

    username

    The username you use to sign in to the web portal.

    Tip If you use NTLM authentication, your username also contains the NTLM domain name. For example MyDomain\\username.

    domainId

    The unique ID of the domaindomain in Collibra Data Intelligence Cloud in which you want to ingest the SSRS assets.

    folderFilter

    An option to exclude specific folders that contain reports or KPIs from the ingestion process.

    You can add multiple folders by listing folder names, providing the full path to folders or by using a wildcard:

    • Use folder names when the folder name is unique: ["folder 1", "folder 2"]
    • Use the full path to the folder to only ingest a specific folder: ["/database1/folder1", "/database2/folder2"]
    • Use a wildcard to ingest all child folders or a specific folder: ["/folder1/*", "/folder2/*"]

    You can also use a combination of these methods. For example, ["folder 1", "/database/folder2", /folder3/*"]

    Important This property must be included in your configuration file and it cannot be empty. If you want to ingest all folders, use *, for example: "folderFilter":["*"].

    Tip For more information about connecting to a SSRS or PBRS folder, see the Microsoft documentation.

    deleteRawMetadataAfterProcessing

    The lineage harvester harvests metadata from specified data sources and uploads it in a ZIP file to a Collibra Data Lineage service instance, for processing.

    You can use this optional property to specify whether or not the raw metadata should be deleted after it has been processed.

    The default value is false.

    If the property is set to true, the raw metadata is deleted after processing. If set to false, it is stored in an Amazon S3 bucket.

    Note 
    • Setting this property to true can negatively impact performance.
    • This property is not yet supported by the technical lineage backend, so it can't be used yet. Backend support is coming soon.
    <Power BI>

    This configuration section contains the required information for Power BI integration via the lineage harvester.

    type
    The kind of data source. In this case, the value has to be PowerBI.
    id

    The unique ID to identify the Power BI service metadata that was uploaded to the Collibra Data Lineage service instance.

    tenantDomain

    The Power BI tenant domain is the domain associated with the Microsoft Azure tenant.

    This domain is either a default domain or a custom domain. For example, collibrapowerbi.onmicrosoft.com.

    Note Usually, you can find a list of Power BI tenant or server domains in your Azure Active Directory or in the top right menu.

    loginFlow

    This section describes the authentication information for accessing your Power BI metadata.

    The lineage harvester supports two authentication methods: service principal, and username and password. For complete information on your authentication options, see Authentication.

    type

    This depends on the authentication method you use.

    • Service principle: The value should be ServicePrincipal.
    • Username and password: The value should be ResourceOwnerPasswordCredentials.
    applicationId
    The unique ID of the Microsoft Azure Application (client) ID.
    username

    The email address of your Azure Active Directory user.

    Tip This property only applies if you are using the username and password authentication method.

    domainId
    The reference ID of the domain in Collibra in which you want to ingest Power BI metadata.
    deleteRawMetadataAfterProcessing

    The lineage harvester harvests metadata from specified data sources and uploads it in a ZIP file to a Collibra Data Lineage service instance, for processing.

    You can use this optional property to specify whether or not the raw metadata should be deleted after it has been processed.

    The default value is false.

    If the property is set to true, the raw metadata is deleted after processing. If set to false, it is stored in an Amazon S3 bucket.

    Note 
    • Setting this property to true can negatively impact performance.
    • This property is not yet supported by the technical lineage backend, so it can't be used yet. Backend support is coming soon.
  3. Save the configuration file.

What's next

Run the lineage harvester. When you run the lineage harvester and encounter errors that are related to the lineage harvester configuration file, you can use the technical lineage troubleshooting guide or Collibra Support Portal to fix the errors.