Lineage harvesting app command options and arguments

Warning The lineage harvester is now deprecated and will officially reach its end-of-life on July 31, 2026. To ensure a smooth transition, we encourage you to begin creating technical lineage via Edge, if you haven't already.

After creating a configuration file, you can use the lineage harvester to perform specific actions with the data sources that are defined in your configuration file.

Tip If you run the lineage harvester in command line, you will see an overview of possible command options and arguments that you can use.

Note Running a full-sync via the lineage harvester (deprecated) does not trigger synchronization for any technical lineage capabilities configured for an Edge site. Edge capabilities are manually synchronized or automated via a schedule.

Typical command options and arguments

The following table shows the most commonly used command options and arguments. You can see a full list of commands by entering the --help command in the command line. Note that commands that are not listed in this table are intended for internal use.

Command	Description
`full-sync`	Harvests all of the metadata from the data sources mentioned in your configuration file and uploads it to the Collibra Data Lineage service instance, where the metadata is analyzed and then synchronized with the corresponding assets in Data Catalog. For more information, see The synchronization process for technical lineage. When you enter this command, the lineage harvester starts the synchronization process. The total number of ingested data sources is shown. Synchronization fails if: The lineage harvester does not find any data sources. The `useSystemName` value is not the same for all data sources. The value of `useSystemName` is based on the following settings: The `useCollibraSystemName` property in the lineage harvester configuration file for different data sources. The Collibra system name setting on Edge. Tip Before running the `full-sync` command, consider whether it makes more sense to run the `sync` command. The `full-sync` command harvests metadata, uploads it, analyzes it, and synchronizes the metadata with the corresponding assets in Data Catalog. The `sync` command only synchronizes metadata that is already on the Collibra Data Lineage service instance, meaning there is no new harvesting, uploading, or analyzing metadata. Example scenario in which the sync command is helpful: Because `full-sync` includes analysis, if you run a `full-sync` and a batch of metadata fails analysis, synchronization also fails. Therefore, lineage for the relevant data source is lost and the assets in Data Catalog are marked as "Missing from Source". However, if you run the `sync` command, Collibra Data Lineage uses the last successfully analyzed batch of metadata and completes the synchronization. In that case, the assets in Data Catalog won’t perfectly reflect the current metadata in the data source, but it allows you to maintain lineage for the data source while you investigate why the most recent batch of metadata failed analysis.
`-s "<ID of data source>"`	Harvests and uploads to the Collibra Data Lineage service instance metadata only from a single specified data source. For example, `full-sync -s "myOracleDataSource"`. The specified data source must be mentioned in your configuration file. This reduces the time to upload to the Collibra Data Lineage service instance metadata from a specific data source, because the other data sources mentioned in your configuration file are not harvested. You can use this argument multiple times to include multiple data sources. Important When the metadata from the specified data source has been uploaded to the Collibra Data Lineage service instance and successfully analyzed, the last successful batches of metadata for all data sources listed in your configuration file are merged and synchronized with Data Catalog.Therefore, the amount of time required for a `full-sync` and a `full sync -s "<ID of data source>"` is about the same. Tip If you only want to harvest metadata from a specified data source and analyze it, without triggering a synchronization with Data Catalog: Run `bin/lineage-harvester load-sources -s MySource`, to load the data source and create the ZIP file. Run `bin/lineage-harvester analyze ${zip_file_from_step_2}`, to analyze the metadata on the Collibra Data Lineage service instance. However, this will not update the technical lineage graph with the latest information from the data source. The only way to reflect the current state of the data source in the technical lineage is to run a `full-sync`.
`--no-matching`	Uploads a technical lineage without stitching the data objects in your technical lineage to the corresponding Column and Table assets in Data Catalog. Note As a result, you won't see the technical lineage of a specific Table or Column asset, but you can still see and browse the full technical lineage.
`sync`	Whereas `full-sync` ingests metadata onto the Collibra Data Lineage service, processes the metadata and syncs it with assets in Data Catalog, the `sync` command only performs this last part: it syncs metadata that already exists on the Collibra Data Lineage service instance with the corresponding assets in Data Catalog. For more information, see The synchronization process for technical lineage. When you enter this command, the lineage harvester starts the synchronization process. The total number of ingested data sources is shown. Synchronization fails if: The lineage harvester does not find any data sources. The `useSystemName` value is not the same for all data sources. The value of `useSystemName` is based on the following settings: The `useCollibraSystemName` property in the lineage harvester configuration file for different data sources. The Collibra system name setting on Edge. Example: how to use the `sync` command to add a new data source without re-harvesting all data sources Let's say you've run `bin/lineage-harvester full-sync`, to upload from all data sources, process the metadata and sync with Data Catalog. You then decide that you want to add a new data source, but not harvest all data sources again. Reference the new data source in the lineage harvester configuration file. Let's say that the new data source has the ID "MyNewSource". Run `bin/lineage-harvester load-sources -s MyNewSource`, to load the new data source and create the ZIP file. Run `bin/lineage-harvester analyze ${zip_file_from_step_2}`, to analyze the new data source on the Collibra Data Lineage service. Run `bin/lineage-harvester sync`, to sync all of the data sources referenced in your configuration file and Data Catalog.
`-s "<ID of data source>"`	Syncs only the metadata on the Collibra Data Lineage service, from a specified data source. For example, `sync -s "myOracleDataSource"`. The specified data source must be mentioned in your configuration file. This command allows you to sync data from one data source without refreshing the other data sources. You must have previously uploaded the metadata to the Collibra Data Lineage service. Warning Only the sources you specify are synced. This means that any previously ingested metadata from non-specified sources, in Data Catalog, is deleted, along with its existing technical lineage. If this is not your intention, consider using `full-sync -s`. With `full-sync -s`, all sources are synced, regardless of which sources are specified by the `-s` command. Therefore, any previously ingested metadata from non-specified data sources remains, as do the respective technical lineages. Note You can use this argument multiple times to include multiple data sources.
`analyze ${name-of-zip-file}`	Analyzes a specified batch (ZIP file) of metadata on the Collibra Data Lineage service instance. The Sources tab page shows the transformation details or source code that was analyzed and the results of the analysis.
`load-sources`	Downloads the SQL files of a data source that is stored locally and cannot be accessed via the network. The lineage harvester then stores the data source information in a ZIP file.
`-s <ID of data source>`	Downloads only the SQL files of a data source with a specific ID. For example, `load-sources -s "myOracleDataSource"`. Note You can use this argument multiple times to include multiple data sources.
`list-sources`	Lists all of the data sources that will be used to create a technical lineage. When you enter this command, up to 500 data sources are listed per page by default. The list includes the following details for each data source: `Source ID <ID of data source> (from edge: false\|true) (useSystemName: false\|true)`. `Source ID <ID of data source>` The source ID of your data source. `from edge: false\|true` Indicates whether the data source is ingested by using technical lineage via Edge. If the value is `true`, the data source is ingested by using technical lineage via Edge. If the value is `false`, the data source is ingested by using the lineage harvester. `useSystemName: false\|true` Indicates whether Collibra Data Lineage uses the system or server name of the data source to match the System asset in Data Catalog. If the value is `true`, the system or server name of the data source is used. If the value is `false`, the system or server name of the data source is not used. The value of `useSystemName` is based on the following settings: The `useCollibraSystemName` property in the lineage harvester configuration file for the data source. The Collibra system name setting for the data source on Edge. Example `Source ID 1redshift (from edge: false) (useSystemName: false)` indicates that the data source with the `1redshift` source ID was ingested by using the lineage harvester, and the system name of the data source is not used to match the System asset in Data Catalog.
`-p <page number>`	Specifies the page to be displayed. The value of `<page number>` must be greater than 0. This option is optional. For example, if you enter `list-sources -p 2`, page 2 is displayed with a default page size of 500 data sources listed. If there are less than 500 data sources in total, an error message is issued. Note To use the `-p`, `-s`, and `-all` options, you must have the lineage harvester version 2023.05 or newer.
`-s <number of data sources>`	Specifies the number of data sources to be listed on one page. The value of `<number of data sources>` must be in the range 0 - 500. This option is optional. For example, if you enter `list-sources -s 40`, default page 1 is displayed with 40 data sources listed. If there are 80 data sources in total, you see the Displaying page 1 of 2 message and a list of 40 data sources. If you enter `list-sources -p 3 -s 20`, page 3 is displayed with 20 data sources listed. If there are 80 data sources, you see the Displaying page 3 of 4 message and a list of 20 data sources. Note To use the `-p`, `-s`, and `-all` options, you must have the lineage harvester version 2023.05 or newer.
`-all`	Lists all data sources. The data sources are not formatted in pages. If you enter this option with the `-p` and `-s` options, this option overrides the `-p` and `-s` options. For example, if you enter `list-sources -p 3 -s 20 --all`, all data sources are listed. Note To use the `-p`, `-s`, and `-all` options, you must have the lineage harvester version 2023.05 or newer.
`ignore-source <source_id>`	Ignores the specified data source from the list of data sources that will be used to create the technical lineage, where `<source_id>` is the ID of the data source that you want to ignore. When you create the technical lineage again by entering the `sync` command or synchronizing a technical lineage capability via Edge, the specified data source is ignored. You can specify only one source ID at a time. If your source ID includes spaces, enclose the source ID in double or single quotation marks, for example `ignore-source "Source A"`. You can use this command to delete the technical lineage of a data source by using the lineage harvester. For details, go to Delete the technical lineage of a data source. Note To use the `ignore-source` command, you must have the lineage harvester version 2023.04 or newer.
`cat passwords.json \| ./bin/lineage-harvester <command-like-full-sync> --passwords-stdin`	Provides passwords of your Collibra Platform instance and the data sources in your configuration file to the lineage harvester without storing the passwords in the lineage harvester folder. You can replace `cat passwords.json` by a string generated by your password manager.
`test-connection`	Checks the connectivity to the Collibra Data Lineage service instance and to Data Catalog. The logs will also show the IP addresses of the Collibra Data Lineage service instances that you have to allow. This command is mostly used for troubleshooting purposes.
`--help`	Shows an overview of all supported command options and arguments that you can use in the lineage harvester.
`--version`	Shows the version of the lineage harvester that you are using.
`set JAVA_OPTS="-Dlineage-harvester.log.dir=path/to/log/dir`	Determine the path of the log file.