Prepare the lineage harvester configuration file

Before you can visualize the technical lineage or ingest a BI source, you have to create a configuration file for the (meta)data sources that you want to process. This configuration file is used by the lineage harvester to extract data from (meta)data sources for which you want to create a technical lineage or you want to ingest.

Note   

Tip   If you want to ingest and create a technical lineage for Looker or Power BI, we highly advise you to read the dedicated sections.

Prerequisites

Steps

  1. Run the following command line to start the lineage harvester:
    • Windows: .\bin\lineage-harvester.bat
    • for other operating systems: chmod +x bin/lineage-harvester and then bin/lineage-harvester
    An empty configuration file is created in the lineage harvester config folder.
  2. Open the configuration file and enter the values for each property.
    Tip   

    Use these options to filter the rows of the table to your needs.

    Supported integrations:


    Tip   You can use the configuration file generator to create an example configuration file with the properties of your choosing. You can easily copy this example to your configuration file and replace the values of the properties to match your data source information.

    Properties

    Description
    general

    This section describes the connection between Collibra lineage and Data Catalog.

    catalog

    This section contains information that is necessary to connect to Data Catalog.

    Note   Versions of the lineage harvester older than 1.1.2 show collibra instead of catalog.

    url

    The URL of your Collibra environment.

    Note   You can only enter the public URL of your Collibra environment. Other URLs will not be accepted.

    username

    The username that you use to sign in to Collibra.

    useCollibraSystemName

    Indication whether you want to use the system or server name of a data source to match to the System asset you created when you prepared the physical data layer. This is useful when you have multiple databases with the same name.

    By default, the useCollibraSystemName property is set to False. If you want to use it, set it to True.

    Warning   Unless you have multiple databases with the same name, we highly recommend that you don't change the default value.

    sources

    This section describes the data sources for which you want to create the technical lineage. You have to create a configuration section for each data source.

    Note   You can add multiple data sources to the same configuration file.

    <SQL directory properties>

    This configuration section contains the required information of one individual SQL directory with connection type "Folder".

    id

    The unique ID of the data source. For example, my_first_data_source.

    type

    The kind of data source. In this case, the value has to be SqlDirectory.

    path

    The full path to the SQL directory.

    mask

    The pattern of the file names in the directory. By default, this is *.

    recursive

    Indication of the files you want to harvest:

    • false (default): Only harvest the files in directly under the folder in the SQL directory path.
    • true: Harvest all files under the folder in the SQL directory path and subdirectories.
    dialect
    The dialect of the database.
    database

    The name of your database, which is the full name of your Database asset.

    Note   You have to use the same database name as the full name of the Database asset that you create when you prepare the physical data layer in Data Catalog.
    Important   

    Teradata and MySQL data sources do not have schemas. As a result, Teradata and MySQL databases are stored in Data Catalog and technical lineage as Schema assets. The technical lineage Browse tab pane shows the following names:

    • For Teradata:
      • The database name is the name that you enter in the collibraSystemName property.
      • The schema name is the name that you enter in the database property.
    • For MySQL:
      • The database name is the name that you enter in the database property.
    collibraSystemName

    The name of the data source's system or server. This is also the full name of your System asset.

    Note   You must use the same system name as the full name of the System asset that you create when you prepare the physical data layer in Data Catalog. If you don't prepare the physical data layer, Collibra Data Lineage cannot stitch the data objects in your technical lineage to the assets in Data Catalog.
    schema

    The name of the schema in your data source, which is the name of your Schema asset.

    Note   You must use the same schema name as the name of the Schema asset that you create when you prepare the physical data layer in Data Catalog.
    verbose

    Indication whether you want to enable verbose logging.

    By default this is set to True. If you don't want to use verbose logging, set it to False.

    <External directories>

    This configuration section contains the required information to connect to the following data sources:

    • Informatica PowerCenter
    • SQL Server Integration Services (SSIS).
    • IBM InsfoSphere DataStage

    Note   Make sure that you have prepared a local folder with the Informatica objects, SSIS files or DataStage files for which you want to create a technical lineage.

    collibraSystemName

    The name of the data source's system or server. If the useCollibraSystemName property is set to true, you must prepare a configuration file to provide the system information.

    id

    The unique ID of your data source. For example, my_informatica.

    type

    The kind of data source. In this case, the value has to be ExternalDirectory.

    dirType

    The type of external directory. The value has to be one of the following:

    • infa, for an Informatica PowerCenter data source.
    • ssis, for a SQL Server Integration Service data source.
    • datastage, for a IBM InfoSphere DataStage source.
    path

    The full path to the folder where you stored the data source.

    mask

    The pattern of the file names in the directory. By default, this is *.

    recursive

    Indication whether you want to use recursive queries.

    By default, this is set to False. If you want to use recursive query, set it to True.

    <Informatica Intelligent Cloud Services Data Integration>

    This configuration section contains the required information to enable the lineage harvester to collect and process Data Integration objects.

    Tip   Make sure you have READ permission on all data objects that you want to harvest.

    type

    The kind of data source. In this case, the value has to be IICS.

    id

    The unique ID that is used to identify the data source on the Collibra Data Lineage server. For example, my_data_integration.

    collibraSystemName

    The name of the Informatica server or system. If the useCollibraSystemName property is set to true, you must prepare a configuration file to provide the system information.

    loginURL

    The URL of the Informatica Intelligent Cloud Services environment sign-in page. For example: https://dm-us.informaticaintelligentcloud.com.

    username

    The username you use to sign in to Informatica Intelligent Cloud Services.

    objects

    The objects that you want to export. Each object requires a path and a type, for example:

    "objects": [
    	{
    		"path" : "Sales",
    		"type" : "Project"
    	}, 
    	{
    		"path" : "Finance/Task_Flows",
    		"type" : "Folder"
    	},
    	{
    		"path" : "Common/Task_Flows/tf_CalendarDimension",
    		"type" : "Taskflow"
    	}
    ]

    The following section provides information to identify and access Data Integration objects.

    Tip   For more information about the objects that you can export and the required information, see the Informatica documentation.

    path

    The full path to the object.

    type

    The type of the object. For example, Taskflow.

    IICS scanner's starting point is a Taskflow. Therefore the only meaningful types to export are: Taskflow, Project and Folder.

    Note   The types are not case sensitive.

    <Custom lineage>

    This section contains the required information to connect to a custom lineage. You create a custom lineage by adding connection properties to a JSON file containing a predefined technical lineage.

    Make sure that you have prepared a local folder with the JSON file that contains the predefined technical lineage.

    Note   You can only create a local folder with one JSON file. However, you can add other files in the harvested directory and subdirectories to which you can refer in the JSON file.

    id

    The unique ID of your custom technical lineage. For example, MyCustomLineage.

    type

    The kind of data source. In this case, the value has to be ExternalDirectory.

    dirType

    The type of external directory. In this case, the value is custom-lineage.

    path

    The full path to the folder where you stored the data source or JSON file.

    <database properties>

    This configuration section contains the required information of one individual data source with connection type "JDBC".

    id

    The unique ID of your data source. For example, my_second_data_source.

    type

    The kind of data source. In this case, the value has to be Database.

    username

    The username that you use to sign in to your data source.

    dialect

    The dialect of the database.

    databaseNames

    The names or IDs of your databases.

    Enter the database names of your data source between double quotes ("") and put everything between square brackets. If you want to include more than one database, separate them by a comma. For example, ["MyFirstDatabase", "MySecondDatabase"].

    Note   You have to use the same database names as the full names of the Database assets that you create when you prepare the physical data layer in Data Catalog.
    Important   

    Teradata and MySQL data sources do not have schemas. As a result, Teradata and MySQL databases are stored in Data Catalog and technical lineage as Schema assets. The technical lineageBrowse tab pane shows the following names:

    • For Teradata:
      • The database name is the name that you enter in the collibraSystemName property.
      • The schema name is the name that you enter in the databaseNames property.
    • For MySQL:
      • The database name is the name that you enter in the databaseNames property.
    connectAsServiceName

    The option to determine whether your Oracle database uses an Oracle service name or SID.

    • True: Connect to an Oracle database that uses an Oracle service name. Enter the service name in the databaseNames property.
    • False: Connect to an Oracle database that uses an SID. Enter the SID in the databaseNames property.

    Note   This property is only valid for Oracle databases. It will be ignored for all other databases.

    hostname

    The name of your data base host.

    collibraSystemName

    The name of the data source's system or server. This is also the full name of your System asset in Data Catalog.

    Note   You must use the same system name as the full name of the System asset that you create when you prepare the physical data layer in Data Catalog. If you don't prepare the physical data layer, Collibra Data Lineage cannot stitch the data objects in your technical lineage to the assets in Data Catalog.
    port

    The port number.

    customConnectionProperties

    An option to enable the lineage harvester to read additional connection parameters. This parameter is only required in very specific situations. If you don't need it, you can remove it from the configuration file.

    Note   You can currently only use this property for the following data sources:

    • HiveQL
    • IBM DB2
    • Netezza
    • PostgreSQL
    • Redshift
    • SAP Hana
    • Snowflake
    • Spark SQL
    • Sybase

    <Google BigQuery database>

    This configuration section contains the required information for a Google BigQuery database.

    id

    The unique ID of your data source. For example, my_third_data_source.

    type

    The kind of data source. In this case, the value has to be DatabaseBigQuery.

    projectIDs

    The IDs of your Google BigQuery project. You can add multiple projects. For example, [ "first-project", "second-project", "third-project" ].

    Note   You have to use the same project ID as the full name of the Database asset that you create when you prepare the physical data layer in Data Catalog.
    region

    The location of your BigQuery data. This is the region that you specified when you create a data set.

    You can only add one location as value. However, you can create separate BigQuery entries per location in the configuration file. As a result, you create a complete technical lineage with Google BigQuery data from different locations.

    Note   This property is optional.

    auth

    The path to a JSON file that contains authentication information.

    Tip   For more information about setting up the authentication, see the Google Big Query user guide.

    collibraSystemName

    The name of the Google BigQuery system. This is also the full name of your System asset in Data Catalog.

    Note   You must use the same system name as the full name of the System asset that you create when you prepare the physical data layer in Data Catalog. If you don't prepare the physical data layer, Collibra Data Lineage cannot stitch the data objects in your technical lineage to the assets in Data Catalog.
    <Snowflake database>

    This configuration section contains the required information for a Snowflake database.

    id

    The unique ID of your data source. For example, my_fourth_data_source.

    type

    The kind of data source. In this case, the value has to be DatabaseSnowflake.

    username

    The username that you use to sign in to your data source.

    hostname

    The URL that you use to access Snowflake web console. For example, <AccountName>.snowflakecomputing.com.

    collibraSystemName

    The name of the Snowflake system. This is also the full name of your System asset in Data Catalog.

    Note   You must use the same system name as the full name of the System asset that you create when you prepare the physical data layer in Data Catalog. If you don't prepare the physical data layer, Collibra Data Lineage cannot stitch the data objects in your technical lineage to the assets in Data Catalog.
    databaseNames

    The names of your databases.

    Enter the database names of your data source between double quotes ("") and put everything between square brackets. If you want to include more than one database, separate them by a comma. For example, ["MyFirstSnowflakeDatabase", "MySecondSnowflakeDatabase"]

    Note   You have to use the same database names as the full names of the Database assets that you create when you prepare the physical data layer in Data Catalog.
    warehouse

    The name of your virtual warehouse.

    Note   This property is optional.

    customConnectionProperties

    An option to enable the lineage harvester to read additional connection parameters. This parameter is only required in very specific situations. If you don't need it, you can remove it from the configuration file.

    Example   If you get an OSCP scan error, you can turn OSCP checking off by using the following value: insecureMode=true.

    <SQL files in the lineage harvester output folder>

    This configuration section contains the required information for SQL files of a data source that were previously downloaded by the lineage harvester and is stored in the lineage harvester output folder.

    type

    The kind of data source. In this case, the value has to be LoadedSource.

    id

    The unique ID of the data source that you uploaded to the lineage harvester folder. For example, my_loaded_snowflake_source.

    zipFile

    The full path to the ZIP file that was created in the lineage harvester folder.

    <Power BI>

    This configuration section contains the required information for Power BI integration.

    Note   You have to purchase the Power BI connector and lineage feature. Then you need to add the Power BI connection properties to both the lineage harvester configuration file and the Power BI harvester configuration file to ingest Power BI metadata into Data Catalog.

    type

    The kind of data source. In this case, the value has to be ExistingLineage.

    id

    The unique ID of the Power BI metadata you harvested via the Power BI harvester.

    You must use the same ID as the value you used in the Power BI configuration file sourceID property.

    <Looker>

    This configuration section contains the required information for Looker integration.

    collibraSystemName

    The name of the Looker system or server. If the useCollibraSystemName property is set to true, you must prepare a configuration file to provide the system information.

    id

    The unique ID of your Looker metadata. For example, my_looker.

    Tip   This value can be anything as long as it is unique and human readable. The ID identifies the batch of Looker metadata on the Collibra Data Lineage server.

    type

    The kind of data source. In this case, the value has to be Looker.

    lookerUrl

    The URL to your Looker API.

    Tip   There are two ways to find the Looker API URL:
    • In the API Host URL field in the Looker Admin menu. If this field is empty, you can use the default Looker API URL which you can find in the interactive API documentation.
    • In the interactive API documentation URL. It is the part of the URL before /api-docs/.
    clientId

    The username you use to access the Looker API.

    domainId

    The unique ID of the domain in Collibra Data Intelligence Cloud in which you want to ingest the Looker assets.

  3. Save the configuration file.
  4. Start the lineage harvester again and do one of the following:
    • To process data from all data sources in the configuration file, run the following command:
      For windows:
      .\bin\lineage-harvester.bat full-sync
      For other operating systems:
      ./bin/lineage-harvester full-sync
    • To process data from specific data sources in the configuration file, run the following command:
      For windows:
      .\bin\lineage-harvester.bat full-sync -s "ID of the data source"
      For other operating systems:
      ./bin/lineage-harvester full-sync -s "ID of the data source"
    • The lineage harvester sends the data source information to a Collibra Data Lineage server using Collibra REST API, where it is parsed and analyzed. As a result, the technical lineage is created and shown in Data Catalog.
  5. When prompted, enter the passwords to connect to Collibra and your data sources. Do one of the following:
    • Enter the passwords in the console.
      The passwords are encrypted and stored in /config/pwd.conf.
    • Provide the passwords via command line.
      The passwords are stored locally and not in your lineage harvester folder.

Tip   If the lineage harvester log shows an error message or the harvesting process fails, you can use the technical lineage troubleshooting guide to fix your issue.

What's next?

If you prepared the physical data layer and have the required permissions, you can go to the asset page of a Table, Column Power BI Column or Looker Look asset from the data source that you added in the configuration file and visualize the technical lineage. The technical lineage shows the data source information of data sources that have been successfully analyzed and processed.

The lineage harvester can also use scheduled jobs to synchronize the data sources on fixed times.

Tip   You can check the progress of the technical lineage creation in Activities. The Results field indicates how many relations were imported into Data Catalog. Go to the status page to see the log files of the SQL analysis.