Technical lineage typical workflow

This information shows the typical workflow of creating technical lineage when you use the lineage harvester by entering the full-sync command or synchronize a technical lineage capability on Edge. This information also illustrates the phases involved in the typical workflow.

Typical workflow

  1. The lineage harvester or technical lineage via Edge:
    • Harvests the metadata from the data sources that are specified in the lineage harvester configuration file if you use the lineage harvester or from the data source that is configured in the technical lineage capability on Edge.
    • Uploads metadata collected from all configured data sources to Collibra Data Lineage’s Metadata Ingest Pipeline.
    • Triggers the Sync Pipeline after all metadata has been completely processed.
  2. The Metadata Ingest Pipeline:
    • Parses the metadata for all lineage assets and relations.
    • Stores the assets and relations in the cloud storage.
  3. The Sync Pipeline:
    • Merges all partial lineages into a single data store.
    • Publishes discovered BI assets to Data Catalog.
    • Matches asset IDs from Data Catalog to the assets discovered from the metadata (stitching).
    • Stores the complete lineage in the cloud storage.
    • Publishes newly discovered relations to Data Catalog.
  4. The Lineage Service:
    • Upon request, creates HTML diagrams of the lineage.
  5. Data Catalog:
    • Connects to the lineage service to get the technical lineage to be shown in the technical lineage viewer.

Typical workflow phases

This workflow include three phases: loading, analyzing, and synchronizing. For example, if you have a data source with the source ID named SourceA, the following main actions or phases occur when you enter full-sync -s SourceA or synchronize the capability for SourceA:

  1. Loading: The lineage harvester or technical lineage via Edge harvests the metadata from the SourceA data source.
  2. Analyzing (Metadata Ingest Pipeline): The lineage harvester or technical lineage via Edge uploads the metadata to a Collibra Data Lineage service instance, and the Collibra Data Lineage service instance parses and processes the metadata.
  3. Synchronizing (Sync Pipeline): The Collibra Data Lineage service instance merges the result from the analyzing phase with any other results from the analyzing phase of other data sources. The merged results are then synchronized with the assets in Data Catalog to create or update the technical lineage.

If step 3 fails, the result from the analyzing phase in step 2 is saved on the Collibra Data Lineage service instance. However, the analysis result is not synchronized in Data Catalog and the technical lineage for SourceA is not created or updated. If then you enter full-sync -s or synchronize the capability for another data source, for example, SourceB, and the synchronization succeeds, the analysis results from both SourceA and SourceB are synchronized with the assets in Data Catalog. Subsequently, the technical lineage for both SourceA and SourceB is created or updated.

You can also run all three phases independently by entering the load-sources, analyze ${name-of-zip-file}, and sync commands separately. For details, go to Lineage harvesting app command options and arguments.

Note Collibra Data Lineage can only create Power BI, Tableau, Looker and other BI tool specific assets, if you included a reference to the specific BI tool in the configuration file. No other assets are created during the process. Only new relations between existing and newly created BI assets (for example between two Tableau Data Attribute assets), and between BI column and Column assets (for example between Power BI Column and Column assets) are created.