Prepare an external directory folder for the lineage harvester

If you want to create a technical lineage for Informatica PowerCenter, SQL Server Integration Services (SSIS), IBM InfoSphere DataStage, or dbt core data sources, you have to prepare a folder with the external directory's data source files.

Tip 

Prerequisites

  • You have IBM InfoSphere Information Server version 11.5 or newer.
  • You have Informatica PowerCenter version 9.6 or newer.
  • You have SQL Server Integration Services 2012 or newer with package format version 6 or newer.
  • You have Microsoft Visual Studio version 2012 or newer.
  • You have downloaded the lineage harvester and you have the necessary system requirements to run it.
  • You have prepared the physical data layer in Data Catalog.
    Note To stitch the data objects in the source and target data sources in external directories with Data Catalog assets, you first have to register those data sources in Data Catalog.

Steps

  1. Set the profile of your dbt Core to the environment that you want to retrieve the lineage information from.
  2. Take one of the following actions to enable Collibra Data Lineage to access the necessary SQL files and manifest JSON file for creating technical lineage:
    • If the lineage harvester has access to the location where the dbt run command is run, the SQL files and manifest JSON file are in the target/ directory. Specify the path property with the path to the target/ directory in the lineage harvester configuration file.
    • If the lineage harvester does not have access to the location where the dbt run command is run, complete the following steps:
      1. Create a local folder.
      2. In the dbt Core environment, use the dbt compile command to generate SQL files. You can find these compiled SQL files and the manifest JSON file in the target/ directory of your dbt project. For details, go to About dbt compile command and Manifest JSON file in dbt documentation.
      3. Copy and paste the target/ directory in your local folder and ensure that you maintain the folder structure. Your local folder must contain all files and subdirectories, such as target/manifest.json and target/models/.
      4. Specify the path property with the path to the target/ directory in the lineage harvester configuration file.
  1. Create a local folder.
  2. Export the Informatica objects or repository for which you want to create a technical lineage to the local folder. Make sure to export all objects, parameter files, mappings and sessions at the same time.

    Note 
    • If your folder contains previous versions of the parameter files, objects might be duplicated across different file versions. The duplicated objects cause Collibra Data Lineage to ignore some transformations, resulting in missing lineage and error messages. For example, if a parameter file is exported after a column was added to a table, duplicated objects exist if the previous version of the parameter file remains in the folder. To avoid missing lineage, export all objects and parameter files at the same time.
    • All XML and parameter files, for example PAR, TXT or PRM files in this folder and its subfolders are taken into account when you create a technical lineage, but Collibra Data Lineage only shows a technical lineage for workflows that have mappings with sources, transformations and targets. Collibra supports the most common Informatica PowerCenter transformations. For more information, see the Informatica PowerCenter documentation.
    • When you export a workflow, ensure that all dependencies – meaning referenced folders, mappings, shortcuts, and sessions – are included in the same export file. This applies whether you export the XML file manually or by using the command line. Collibra Data Lineage looks for a TASKINSTANCE in the workflows (and in worklets in workflows). The TASKINSTANCE points to the sessions, which are dependent on mappings. If a TASKINSTANCE can’t be found in the workflows or worklets, lineage cannot be extracted.
    • To create a technical lineage, the following tags must be present in your XML file:
      • <REPOSITORY>
      • <FOLDER>
      • <SOURCE> / <TARGET>
      • <SESSION>
      • <MAPPING> (that contains one or more <TRANSFORMATION> tags)
      • <WORKFLOW> (that contains one or more <TASK> tags)
    • If parameters are missing from the parameter files, an UNRESOLVED PARAMETERS analyze error is shown in the analysis results in the Sources tab page. For more information, go to Analyze errors and possible solutions in Technical lineage Sources tab page.

  3. In the local folder, create a folder named techlin-param and put the parameter files in the techlin-param folder.
  1. Create a local folder.
  2. Export the SSIS files for which you want to create a technical lineage.

    Tip You can export them directly from the SQL Server Integration Services repository or via Microsoft Visual Studio. For more information, see the SQL Server Integration Services documentation.
  3. Store the SSIS files to your local folder. Typically, the folder contains the following files:

    • SSIS package files (DTSX), containing the SQL Server Integration Services source code.
    • Connection manager files (CONMGR), containing environment and connection information.
    • Parameter files (PARAMS), if applicable.
    Note 
    • All files in this folder and subfolders are taken into account when you create a technical lineage. The lineage harvester automatically detects data sources in the SSIS files.
    • Not all SSIS files are processed and shown in the technical lineage. The lineage harvester retrieves all of the SSIS package files from the server, but only the files that contain lineage information, meaning those that contain a data flow, or Pipeline, are processed.
  1. Create a local folder.
  2. Export the DataStage project files (DSX) for which you want to create a technical lineage. If you want to include runtime parameters in parameter set files in the technical lineage, ensure to export DataStage files with executables.

    Tip You can either export a DataStage project manually or automatically via command line.
  3. Store the DataStage files in your local folder.

  4. Optionally, if your DataStage project uses environment variables, manually export the environment files (ENV).

  5. Give the environment files the same name as the DataStage project files. For example, if your project file is named datastage-project-1.dmx, name your environment file datastage-project-1.env.

  6. Store the environment files in the same local folder.

    Important  
    • Collibra Data Lineage only supports DSX and ENV files.
    • You can have one DSX file per DataStage project.
    • You can have more than one DSX file in the local folder.
    • You can have one or none ENV file per DSX file.
    • The name of the DSX file and the ENV file has to be the same.

What's next

For DataStage, Informatica PowerCenter, and SQL Server Integration Services, if the external directory files do not have the necessary information, such as a database and a schema, you can provide the connection definitions manually in a <source ID> configuration file. The information is required for stitching the data sources. If you have multiple connections, provide connection definitions for each connection, regardless of whether the useCollibraSystemName property in the lineage harvester configuration file is set to true or false.

If the external directory files have all necessary information, you can specify the remaining lineage harvester configuration file and run the lineage harvester to create a technical lineage.

When you run the lineage harvester, the content in your local folder is sent to Collibra Data Lineage service for processing.