Lineage harvesting app command options and arguments

After creating a configuration file, you can use the lineage harvester to perform specific actions with the data sources that are defined in your configuration file.

Tip If you run the lineage harvester in command line, you will see an overview of possible command options and arguments that you can use. If the lineage harvester process fails, you can use the technical lineage troubleshooting guide to fix your issue.

Typical command options and arguments

The following table shows the most commonly used command options and arguments.

Command Description
full-sync

Uploads all of the metadata from the data sources mentioned in your configuration file to the Collibra Data Lineage service, where the metadata is then processed and uploaded to Data Catalog.

-s "<ID of data source>"

Uploads only the metadata from a specified data source. For example, full-sync -s "myOracleDataSource". The specified data source must be mentioned in your configuration file.

This command allows you to process data from a newly added data source or to refresh a data source in the configuration file, without refreshing the other data sources. This reduces the time you need to upload your data sources, since you only upload specific ones without affecting the others. If you want to process multiple data sources, add -s "ID of another data source" per data source to the command.

Note You can use this argument multiple times to include multiple data sources.

--no-matching

Uploads a technical lineage without stitching the data objects in your technical lineage to the corresponding Column and Table assets in Data Catalog.

Note As a result, you won't see the technical lineage of a specific Table or Column asset, but you can still see and browse the full technical lineage.

sync

Whereas full-sync ingests metadata onto the Collibra Data Lineage service, processes the metadata and syncs it with assets in Data Catalog, the sync command only performs this last part: it syncs the metadata—as it exists on the Collibra Data Lineage service—and your assets in Data Catalog.

Tip See the following example for advice on how to use the sync command to add a new data source without re-harvesting all data sources.

Example

Let's say you've run bin/lineage-harvester full-sync, to upload from all data sources, process the metadata and sync with Data Catalog. You then decide that you want to add a new data source, but not harvest all data sources again.

  1. Reference the new data source in the lineage harvester configuration file. Let's say that the new data source has the ID "MyNewSource".
  2. Run bin/lineage-harvester load-sources -s MyNewSource, to load the new data source and create the ZIP file.
  3. Run bin/lineage-harvester analyze ${zip_file_from_step_2}, to analyze the new data source on the Collibra Data Lineage service.
  4. Run bin/lineage-harvester sync, to sync all of the data sources referenced in your configuration file and Data Catalog.
-s "<ID of data source>"

Syncs only the metadata on the Collibra Data Lineage service, from a specified data source. For example, sync -s "myOracleDataSource". The specified data source must be mentioned in your configuration file.

This command allows you to sync data from one data source without refreshing the other data sources. You must have previously uploaded the metadata to the Collibra Data Lineage service.

Warning Only the sources you specify are synced. This means that any previously ingested metadata from non-specified sources, in Data Catalog, is deleted, along with its existing technical lineage. If this is not your intention, consider using full-sync -s. With full-sync -s, all sources are synced, regardless of which sources are specified by the -s command. Therefore, any previously ingested metadata from non-specified data sources remains, as do the respective technical lineages.

Note You can use this argument multiple times to include multiple data sources.

analyze ${name-of-zip-file}

Analyzes a specified batch (ZIP file) of metadata on the Collibra Data Lineage service instance.

The Sources tab page shows the transformation details or source code that was analyzed and the results of the analysis.

load-sources

Downloads all your data sources in a separate ZIP file, per data source, to the lineage harvester output folder.

-s <ID of data source>

Downloads only the data source with a specific ID. For example, load-sources -s "myOracleDataSource".

Note You can use this argument multiple times to include multiple data sources.

list-sources

Lists all of the data sources that will be used to create a technical lineage. The list includes data sources that were ingested by the lineage harvester and technical lineage via Edge.

ignore-source <source_id>

Ignores the specified data source from the list of data sources that will be used to create the technical lineage, where <source_id> is the ID of the data source that you want to ignore.

You can specify only one source ID, and the source ID must not contain any spaces. If your source ID includes spaces, you can use the lineage harvester pre-release version 2023.04-0-4 or newer as a workaround. With this version, you can enclose the source ID with spaces in double or single quotation marks, for example ignore-source "Source A".

When you synchronize the technical lineage again, the specified data source is ignored. For details, go to Delete the technical lineage of a data source if you use the lineage harvester and Delete the technical lineage of a data source on Edge for technical lineage via Edge.

cat passwords.json | ./bin/lineage-harvester <command-like-full-sync> --passwords-stdin

Provides passwords of your Collibra Data Intelligence Cloud instance and the data sources in your configuration file to the lineage harvester without storing the passwords in the lineage harvester folder.

You can replace cat passwords.json by a string generated by your password manager.

test-connection

Checks the connectivity to the Collibra Data Lineage service instance and to Data Catalog. The logs will also show the IP addresses of the Collibra Data Lineage service instances that you have to allow.

This command is mostly used for troubleshooting purposes.

--help

Shows an overview of all supported command options and arguments that you can use in the lineage harvester.

--version

Shows the version of the lineage harvester that you are using.

-Dlineage-harvester.log.dir=path/to/log/dir

Determine the path of the log file.