The lineage harvester

You use the lineage harvester to collect source code from your data sources and create new relations between data elements from your data source and existing assets into Data Catalog.

The lineage harvester runs close to the data source and can harvest transformation logic like SQL scripts and ETL scripts from a specific location, for example a database table or a folder on a file system.

The lineage harvester connects to different Collibra Data Lineage service instances based on your geographical location and cloud provider. Ensure you have the correct system requirements before you run the lineage harvester. If your location or cloud provider changes, the lineage harvester re-harvests all your data sources.

Note Technical lineage is created by a cloud-based service. You only connect to the cloud via an API call that is triggered by the lineage harvester.

The lineage harvester configuration file

The lineage harvester uses a configuration file to connect to data sources, BI tools and ETL tools. The configuration file contains references to the data sources for which you want to create a technical lineage. You have to prepare the configuration file if you want to create a technical lineage and add new relations of the type "Data Element targets / sources Data Element" between existing assets in Data Catalog and "Column is target of / is source of Data Attribute" between assets from ingested BI sources and assets in Data Catalog.

The lineage harvester components

The lineage harvester consists of components that harvest the metadata from the data sources specified in your configuration file and send their metadata to the Collibra Data Lineage service.

Using the lineage harvester

If you want to separately process data sources on different servers, you can use more than one lineage harvester connected to a single Collibra Data Intelligence Cloud instance. In this case, you can create a configuration file for the lineage harvester on each server and configure different data sources in each configuration file.

Note 
  • Use multiple configuration files for lineage harvesters on different servers only for testing purposes.
  • You can use different command options and arguments to perform various actions with the lineage harvester.

Permissions

You need a global role with the System Administration global permission, for example Sysadmin. This role must have access to all assets in the data sources in the configuration file and be able to create new relations between these assets.

Typical workflow

You use the lineage harvester to run the full-sync command. That triggers the following actions:

  1. The lineage harvester:
    • Harvests the metadata from the data sources that are specified in the configuration file.
    • Uploads metadata collected from all configured data sources to Collibra Data Lineage’s Metadata Ingest Pipeline.
    • Triggers the Sync Pipeline after all metadata has been completely processed.
  2. The Metadata Ingest Pipeline:
    • Parses the metadata for all lineage assets and relations.
    • Stores the assets and relations in the cloud storage.
  3. The Sync Pipeline:
    • Merges all partial lineages into a single data store.
    • Publishes discovered BI assets to Data Catalog.
    • Matches asset IDs from Data Catalog to the assets discovered from the metadata (stitching).
    • Stores the complete lineage in the cloud storage.
    • Publishes newly discovered relations to Data Catalog.
  4. The Lineage Service:
    • Upon request, creates HTML diagrams of the lineage.
  5. Data Catalog:
    • Connects to the lineage service to get the technical lineage to be shown in the technical lineage viewer.

Note The lineage harvester can only create Power BI, Tableau, Looker and other BI tool specific assets, if you included a reference to the specific BI tool in the configuration file. No other assets are created during the process. Only new relations between existing and newly created BI assets (for example between two Tableau Data Attribute assets), and between BI column and Column assets (for example between Power BI Column and Column assets) are created.