Tips for successful lineage synchronization

Important The key takeaway from this topic is this: When synchronizing multiple data sources via Edge, it's critical that you understand how to use the Analyze option in the Processing Level setting, to ensure that the synchronization for all data sources is done in a single job. In short, if you synchronize multiple data sources without selecting the Analyze option, synchronization will likely fail for all but the first and last data sources.

The process for synchronizing metadata and corresponding assets in Data Catalog varies depending on whether you use the CLI lineage harvester or Edge:

  • When using the CLI lineage harvester, synchronization is triggered via a single CLI command, and all data sources are synchronized as a single job.
  • On Edge, if you synchronize multiple data sources and don't select the Analyze option in the Edge capability, each data source will be synchronized as a separate job. This is highly inefficient and will likely lead to failed sync jobs.

Synchronization via Edge

In contrast to the CLI lineage harvester and its single configuration file, with Edge, each data source requires its own Edge connection and capability, and synchronization is triggered for each data source via the Integration Configuration tab in Data Catalog.

The mandatory Processing Level setting in the technical lineage Edge capabilities determines when synchronization begins.

Processing level option Description
Load

Harvest metadata from the data source and upload it to your Collibra environment. This allows you to inspect and, if necessary, edit the harvested metadata before uploading it to the Collibra Data Lineage service instance for analysis.

Neither analysis nor synchronization starts after the metadata is uploaded.

Analyze

Metadata is loaded and analyzed on the Collibra Data Lineage service instance.

Synchronization does not start after analysis; it starts only after either:

  • You trigger synchronization of another data source for which you specify "Sync" in the Processing Level drop-down list.
  • You configure the Technical Lineage Admin Edge capability, and trigger synchronization via the Sync option in the Integration Configuration tab in Data Catalog.

Important  If you want to synchronize multiple data sources, we strongly recommend that you select this option in the respective Edge capabilities for each of your data sources. This allows you to synchronize all data sources in a single job, thereby maximizing efficiency and mitigating the risk of failed synchronization jobs.
Sync

Load, analyze, and synchronize metadata from all data sources. Synchronization starts – or is queued, if another synchronization job is running – immediately after analysis.

Important  If you want to synchronize multiple data sources and you select this option, each data source is processed as a separate job. This is highly inefficient and will likely lead to failed sync jobs.

Tip Keep in mind that the actual synchronization job involves merging all metadata batches, synchronizing with Data Catalog, and creating relations for stitching. This can take a long time. With this in mind, the most effective strategy for synchronizing multiple data sources is to analyze the metadata of all data sources concurrently. When analysis is complete, synchronize all data sources in a single job. There are two ways you can set this up. For examples, see Using the Analyze option to ensure a single sync job below.

The synchronization jobs queue and processing order

Only one synchronization job can be processed at a time. This means that if you start synchronizations for multiple data sources without selecting the Analyze option, metadata from the relevant data sources is harvested and analyzed, but while the synchronization job for the first data source is in process, the rest are put in a queue.

The synchronization queue is processed according to the Last-in, First-out (LIFO) method. Let's say the synchronization of data source A is in progress, and the synchronization jobs for data sources B, C, D, and E are in the queue. When the synchronization job for data source A is complete, synchronization of data source E will begin. The synchronization jobs for data sources B, C, and D are canceled.

Important The result for data source E – either success or fail – will also be shown for data sources B, C, and D, despite the fact that the sync jobs for those data sources were canceled.

Using the Analyze option to ensure a single sync job

When synchronizing multiple data sources, use the Analyze option in the Processing Level setting of the individual capability templates, to ensure that the synchronization of the metadata for all data sources is done in a single job.

2 recommended means of triggering synchronization of multiple data sources

Example 1: Trigger the sync job for the last in a succession of data sources

Let's say you want to synchronize 5 data sources.

  1. In the respective Edge capabilities for data sources A, B, C, and D, set the Processing Level to Analyze.
  2. Trigger synchronization for data sources A, B, C, and D. Let's say you do this on weekdays.
  3. In the Edge capability for data source E, set the Processing Level to Sync.
  4. On the weekend, trigger the synchronization of data source E.
    Warning It's critical that you only trigger the synchronization of data source E after the analysis of data sources A, B, C, and D is complete.

After the metadata of data source E is harvested and analyzed, all 5 data sources are synchronized with Data Catalog, in a single job.

Example 2: Trigger the sync job using the Technical Lineage Admin capability

Let's say you want to synchronize 5 data sources.

  1. In the respective Edge capabilities for all 5 data sources, set the Processing Level to Analyze.
  2. Create a Technical Lineage Admin Edge connection and add a Technical Lineage Admin Edge capability.
    For step-by-step instruction on how to do this, go to Data lineage admin options.
  3. In Data Catalog, in the Integrations Configuration tab, select and run the Sync option.

All 5 data sources are synchronized with Data Catalog, in a single job.

Synchronization via the lineage harvester

If you create technical lineage via the lineage harvester, you use a single configuration file in which you list all of the data sources from which to ingest metadata. When it comes to synchronizing your data sources, you have a couple of options. The full-sync command starts a process in which metadata is harvested from all data sources and uploaded to the Collibra Data Lineage service instance. The metadata is analyzed and then synchronized with the corresponding assets in Data Catalog.

Note Running a full-sync via the lineage harvester does not trigger synchronization for any technical lineage capabilities configured for an Edge site. Edge capabilities are manually synchronized or automated via a schedule.

Another option is to use the sync command. The sync command only synchronizes metadata that is already on the Collibra Data Lineage service instance with the corresponding assets in Data Catalog, meaning there is no new harvesting, uploading, or analyzing of metadata.

Regardless of whether you use sync or full-sync, the synchronization with Data Catalog happens in a single job. This is optimal because synchronization can take a long time.

For a list of the most commonly used command options and arguments, and descriptions of how they work, go to Lineage harvesting app command options and arguments.

Sync or full-sync?

Before running a full-sync, consider whether it makes more sense to use the sync command. The following table describes a few scenarios in which you can consider using the sync command.

Scenario Details
Add a new data source without re-harvesting from all data sources.

Let's say you run a full-sync, to upload metadata from all data sources, process the metadata and synchronize with the corresponding assets in Data Catalog. You then decide that you want to add a new data source, but you don't want to harvest all data sources again.

  1. Add the new data source to the lineage harvester configuration file. Let's say that the new data source has the ID "MyNewSource".
  2. Run bin/lineage-harvester load-sources -s MyNewSource, to load the new data source and create the ZIP file.
  3. Run bin/lineage-harvester analyze ${zip_file_from_step_2}, to analyze the new data source on the Collibra Data Lineage service instance.
  4. Run bin/lineage-harvester sync, to synchronize all of the data sources referenced in your configuration file and Data Catalog.

Resolve failed synchronization due to batch analysis failure.

If you run a full-sync and a batch of metadata fails analysis, then:

  • The entire synchronization job fails.
  • Lineage for the relevant data source is lost.
  • Assets in Data Catalog are marked as "Missing from Source".

In that case, if you run the sync command, Collibra Data Lineage uses the last successfully analyzed batch of metadata and completes the synchronization. The assets in Data Catalog might not exactly reflect the current metadata in the data source, but it allows you to maintain lineage for the data source while you investigate why the most recent batch of metadata failed analysis.