Tips for successful lineage synchronization

Important The key takeaway from this topic is this: When synchronizing multiple data sources via Edge, it's critical that you understand how the synchronization jobs queue works and how you can use the Analyze only option to ensure that the synchronization for all data sources is done in a single job. In short, if you synchronize multiple data sources without selecting the Analyze Only option in all but the last Edge capability, synchronization will fail for all but the first and last data sources.

The process for synchronizing metadata and corresponding assets in Data Catalog varies depending on whether you use the lineage harvester or Edge:

  • When using the lineage harvester, synchronization is triggered via a single CLI command, and all data sources are synchronized as a single job.
  • On Edge, synchronization is triggered separately for each data source, via the Data Catalog UI. If you synchronize multiple data sources and don't use the Analyze only option in the Edge capability, each data source will be synchronized as a separate job. This can lead to failed sync jobs.

Synchronization via Edge

In contrast to the CLI lineage harvester and its single configuration file, with Edge, each data source requires its own Edge connection and capability, and synchronization is triggered for each data source, via the Integration Configuration tab in Data Catalog.

The Analyze Only option in the Edge capability determines whether or not synchronization with Data Catalog begins, or not, after metadata is successfully analyzed:

  • If the Analyze Only option is selected, synchronization does not start after analysis. This allows you to synchronize all data sources in a single job.
  • If the Analyze Only option is not selected, synchronization starts – or is queued, if another synchronization job is running -- immediately after analysis.

Keep in mind that the actual synchronization, which starts after successful analysis of a batch of metadata, involves merging all metadata batches, synchronizing with Data Catalog, and creating relations for stitching. This can take a long time. Therefore, the most effective strategy for synchronizing multiple data sources is to analyze the metadata of all data sources concurrently. When analysis is complete, synchronize all data sources in a single job. For example of how to configure this, see Using the Analyze Only option to ensure a single sync job below.

The synchronization jobs queue and processing order

Only one synchronization job can be processed at a time. This means that if you start synchronizations for multiple data sources, without selecting the Analyze Only option, metadata from the relevant data sources is harvested and analyzed, but while the synchronization job for the first data source is in process, the rest are put in a queue.

The synchronization queue is processed according to the Last-in, First-out (LIFO) method. Let's say the synchronization of data source A is in progress, and the synchronization jobs for data sources B, C, D, and E are in the queue. When the synchronization job for data source A is complete, synchronization of data source E will begin. The synchronization jobs for data sources B, C, and D are canceled. The result for data source E – either success or fail – will also be shown for data sources B, C, and D, despite the fact that the sync jobs for those data sources were canceled.

Using the Analyze Only option to ensure a single sync job

When synchronizing multiple data sources, use the Analyze Only option in the individual capability templates to ensure that the synchronization of the metadata for all data sources is done in a single job.

Example

Let's say you want to successfully synchronize 5 data sources in the shortest amount of time.

  1. Ensure that the Analyze Only option is selected for data sources A, B, C, and D.
  2. Trigger synchronization for data sources A, B, C, and D.
  3. Ensure that the Analyze Only option is not selected for data source E.
  4. Trigger synchronization for data source E.
    Warning It's critical that you only trigger the synchronization of data source E after the analysis of data sources A, B, C, and D is complete.

After the metadata of data source E is harvested and analyzed, all 5 data sources are synchronized with Data Catalog in a single job.

Synchronization via the lineage harvester

If you create technical lineage via the lineage harvester, you use a single configuration file in which you list all of the data sources from which to ingest metadata. When it comes to synchronizing your data sources, you have a couple of options. The full-sync command starts a process in which metadata is harvested from all data sources and uploaded to the Collibra Data Lineage service instance. The metadata is analyzed and then synchronized with the corresponding assets in Data Catalog.

Note Running a full-sync via the lineage harvester does not trigger synchronization for any technical lineage capabilities configured for an Edge site. Edge capabilities are manually synchronized or automated via a schedule.

Another option is to use the sync command. The sync command only synchronizes metadata that is already on the Collibra Data Lineage service instance with the corresponding assets in Data Catalog, meaning there is no new harvesting, uploading, or analyzing of metadata.

Regardless of whether you use sync or full-sync, the synchronization with Data Catalog happens in a single job. This is optimal because synchronization can take a long time.

For a list of the most commonly used command options and arguments, and descriptions of how they work, go to Lineage harvesting app command options and arguments.

Sync or full-sync?

Before running a full-sync, consider whether it makes more sense to use the sync command. The following table describes a few scenarios in which you can consider using the sync command.

Scenario Details
Add a new data source without re-harvesting from all data sources.

Let's say you run a full-sync, to upload metadata from all data sources, process the metadata and synchronize with the corresponding assets in Data Catalog. You then decide that you want to add a new data source, but you don't want to harvest all data sources again.

  1. Add the new data source to the lineage harvester configuration file. Let's say that the new data source has the ID "MyNewSource".
  2. Run bin/lineage-harvester load-sources -s MyNewSource, to load the new data source and create the ZIP file.
  3. Run bin/lineage-harvester analyze ${zip_file_from_step_2}, to analyze the new data source on the Collibra Data Lineage service instance.
  4. Run bin/lineage-harvester sync, to synchronize all of the data sources referenced in your configuration file and Data Catalog.

Resolve failed synchronization due to batch analysis failure.

If you run a full-sync and a batch of metadata fails analysis, then:

  • The entire synchronization job fails.
  • Lineage for the relevant data source is lost.
  • Assets in Data Catalog are marked as "Missing from Source".

In that case, if you run the sync command, Collibra Data Lineage uses the last successfully analyzed batch of metadata and completes the synchronization. The assets in Data Catalog might not exactly reflect the current metadata in the data source, but it allows you to maintain lineage for the data source while you investigate why the most recent batch of metadata failed analysis.