Prepare an external directory folder

If you want to create a technical lineage for an external directory such as Informatica PowerCenter, SQL Server Integration Services (SSIS) or IBM InfoSphere DataStage, you must prepare a folder with the external directory's data source files.

If the external directory files do not have the necessary information, for example a database and a schema, to stitch the data sources, you can provide the connection definitions manually via a JSON configuration file. This file is optional, unless the useCollibraSystemName in the lineage harvester configuration file is set to true.

Tip 

Note You can also create and configure a JSON file to define a custom technical lineage.

Prerequisites

  • You have IBM InfoSphere Information Server version 11.5 or newer.
  • You have Informatica PowerCenter version 9.6 or newer.
  • You have SQL Server Integration Services 2012 or newer with package format version 6 or newer.
  • You have Microsoft Visual Studio version 2012 or newer.
  • You have downloaded the lineage harvester and you have the necessary system requirements to run it.
  • You have prepared the physical data layer in Data Catalog.
    Note To stitch the data objects in the source and target data sources in external directories with Data Catalog assets, you first have to register those data sources in Data Catalog.

Tip If you want to create a technical lineage for Informatica Intelligent Cloud Services Data Integration, you don't have to create a folder with data source files. You add your data source information directly to the lineage harvester configuration file.

Steps

Important If the useCollibraSystemName in the lineage harvester configuration file is set to true, you must provide an additional configuration file with connection definitions.
  1. Create a local folder.
  2. Export the Informatica objects or repository for which you want to create a technical lineage to the local folder.
  3. Note  All XML and parameter files, for example PAR, TXT or PRM files in this folder and its subfolders are taken into account when you create a technical lineage, but Collibra Data Lineage only shows a technical lineage for workflows that have mappings with sources, transformations and targets. Collibra supports the most common Informatica PowerCenter transformations. For more information, see the Informatica PowerCenter documentation.
  4. Put your parameter files in the right location.

    If...

    Then...

    all parameter files are PAR files

    No action required

    not all parameter files are PAR files

    1. Create a new folder in the local folder.
    2. Name the folder techlin-param.
    3. Move all parameter files that are used by the exported XML to the techlin-param folder.
    4. In the lineage harvester configuration file, set the recursive property to true.
    Note The lineage harvester only takes into account parameter files in the techlin-param folder.
  5. Optionally, create a source ID configuration file with connection definitions and system names:
    Tip If you previously created a technical lineage for Informatica PowerCenter with connection definitions, the connection_definitions.conf file will still be taken into account.
    1. Create a new JSON file in the lineage harvester config folder.
    2. Give the JSON file the same name as the value of the Id property in the lineage harvester configuration file.
      Example The value of the Id property in the lineage harvester configuration file is informatica-source-1. As a result, the name of your JSON file should be informatica-source-1.conf.
    3. For each data source, add the following content to the JSON file:

      Property

      Description

      connectionDefinitions

      This section contains the connection properties to a source in Informatica PowerCenter.

      <connectionName>

      The type of your source or target data source.

      This section contains the connection properties to a source or target in Informatica PowerCenter.

      dbname
      The name of your source or target database.
      schema

      The name of your source or target schema.

      dialect

      The dialect of the referenced database.

      collibraSystemNames

      This section contains the system or server name that is specified in your database and referenced in your connection.

      Note This section is only required when the useCollibraSystemName flag in the lineage harvester configuration file is set to true.

      databases

      This section contains the database information. This is required to connect directly to the system or server of the database.

      dbname
      The name of the database. The database name is the same as the name you entered in the <connectionName> section.
      collibraSystemName

      The system or server name of the database.

      connections

      This section contains the connection information. This is required to reference to the system or server of the connection.

      connectionName

      The name of the connection.

      collibraSystemName

      The system or server name of the connection.

  6. Important If you are using variables in Informatica PowerCenter, add the value of the variable instead of the name in the connection definitions JSON file. For exampple, if the parameter file contains $DBConnection_dwh=DWH_EXPORT then you add the following connection definitions to the JSON file:
    {
    	"DWH_EXPORT":
    
    		{ "dbname": "DWH", "schema": "DBO" }
    }
  7. Add a new section for Informatica PowerCenter to the lineage harvester configuration file.
  1. Create a local folder.
  2. Export the SSIS files for which you want to create a technical lineage.
    Tip You can export them directly from the SQL Server Integration Services repository or via Microsoft Visual Studio. For more information, see the SQL Server Integration Services documentation.
  3. Store the SSIS files to your local folder. Typically, the folder contains the following files:
    • SSIS package files (DTSX), containing the SQL Server Integration Services source code.
    • Connection manager files (CONMGR), containing environment and connection information.
    • Parameter files (PARAMS), if applicable.
    Note All files in this folder and subfolders are taken into account when you create a technical lineage. The lineage harvester automatically detects data sources in the SSIS files.
  4. Optionally, configure the connection definitions:
    Tip If the useCollibraSystemName in the lineage harvester configuration file is set to true, you must provide the connection_definitions.conf file.
    1. Create a new JSON file in the local folder.
    2. Name the JSON file connection_definitions.conf.
    3. For each supported data source, specify the relevant translations.

      Property

      Description

      ConnStringRegExTranslation

      The parent element that opens the connection definitions.

      <regular expression>

      A regular expression that must match one or more connection strings.

      Note 

      Important considerations:

      • By default, the regular expression is not case sensitive. As a consequence, a regular expression can match with connection strings containing uppercase characters or lowercase characters.
      • The connection string is part of the SSIS connection manager.
      • SSIS connection managers are included in an SSIS package files (DTSX) or in connection manager files (CONMGR).
      Example 

      Regular expression: Server=sb-dhub;User ID=SYB_USER2;Initial Catalog=STAGEDB;Port=6306.*
      Explanation: The first section, up to .*, is a literal, but not case-sensitive, match of the characters. The dot (.) can match any single character. The asterisk (*) means zero or more of the previous, in this case any character.
      Match: Any connection string that starts with Server=sb-dhub;User ID=SYB_USER2;Initial Catalog=STAGEDB;Port=6306.
      Example: Server=sb-dhub;User ID=SYB_USER2;Initial Catalog=STAGEDB;Port=6306;Persist Security Info=True;Auto Translate=False;.

      dbname
      The name of your database, to which the data source connection refers.
      schema

      The name of your schema, to which the regular expression refers.

      dialect

      The dialect of the referenced database.

      collibraSystemName

      The name of the referenced data source's system or server.

      This property is only required when you set the useCollibraSystemName property in the lineage harvester configuration file to true. If this property is set to false, you can remove the collibraSystemName property or enter an empty string.

      Note You must use the same system name as the full name of the System asset that you create when you prepare the physical data layer in Data Catalog. If you don't prepare the physical data layer, Collibra Data Lineage cannot stitch the data objects in your technical lineage to the assets in Data Catalog.

      If the “useCollibraSystemName" property is:

      • false, system or server names in table references in analyzed SQL code are now ignored. This means that a table that exists in two different systems or servers is identified (either correctly or incorrectly) as a single data object, with a single asset full name.

      • true, system or server names in table references are considered to be represented by different System assets in Data Catalog. The value of the "collibraSystemName" field is used as the default system or server name.

  5. Add a section for SQL Server Integration Services to the lineage harvester configuration file.
  1. Create a local folder.
  2. Export the DataStage project files (DSX) for which you want to create a technical lineage.
    Tip You can either export a DataStage project manually or automatically via command line.
  3. Store the DataStage files in your local folder.
  4. Optionally, if your DataStage project uses environment variables, manually export the environment files (ENV).
  5. Give the environment files the same name as the DataStage project files. For example, if your project file is named datastage-project-1.dmx, you have you name your environment file datastage-project-1.env.
  6. Store the environment files in the same local folder.
    Important  
    • The lineage harvester only supports DSX and ENV files.
    • You can have one DSX file per DataStage project.
    • You can have one or none ENV file per DSX file.
    • The name of the DSX file and the ENV file has to be the same.
  7. Optionally, configure the connection definitions:
    1. Create a new JSON file in the local folder.
    2. Name the JSON file connection_definitions.conf.
    3. For each data source, specify the relevant translations:

      Property

      Description

      OdbcDataSources

      Open Database Connectivity data sources in IBM InfoSphere DataStage for which you want to create a technical lineage.

      <data-source-name>

      The ODBC data source name that you use in your DataStage projects.

      This section contains the properties to translate the database, schema and dialect.

      dbname
      The name of your database, to which the ODBC data source connection refers.
      schema

      The name of your schema, to which the ODBC data source connection refers.

      dialect

      The dialect of the referenced database.

      collibraSystemName

      The name of the data source's system or server.

      This property is only required when you set the useCollibraSystemName property in the lineage harvester configuration file to true. If this property is set to false, you can remove the collibraSystemName property or enter an empty string.

      Note You must use the same system name as the full name of the System asset that you create when you prepare the physical data layer in Data Catalog. If you don't prepare the physical data layer, Collibra Data Lineage cannot stitch the data objects in your technical lineage to the assets in Data Catalog.

      NonOdbcConnectors

      Other data source connectors in IBM InfoSphere DataStage for which you want to create a technical lineage. For example, DB2, Oracle or Netezza.

      Note This section is optional.

      <data-source-connector-ID>

      The data source username and database of the connector that you use in your DataStage projects. This usually looks like for example admin@database-name. The combination of the username and database name should be unique.

      The following section contains the properties to translate the database, schema and dialect.

      dbname
      The name of your database, to which the data source connection refers.
      schema

      The name of your schema, to which the data source connection refers.

      dialect

      The dialect of the referenced database.

      collibraSystemName

      The name of the data source's system or server.

      This property is only required when you set the useCollibraSystemName property in the lineage harvester configuration file to true. If this property is set to false, you can remove the collibraSystemName property or enter an empty string.

      You must use the same system name as the full name of the System asset that you create when you prepare the physical data layer in Data Catalog. If you don't prepare the physical data layer, Collibra Data Lineage cannot stitch the data objects in your technical lineage to the assets in Data Catalog.
  8. Add a section for IBM InfoSphere DataStage to the lineage harvester configuration file.

What's next

You can now prepare the rest lineage harvester configuration file and run it to create a technical lineage for Informatica PowerCenterSQL Server Integration ServicesIBM InfoSphere DataStage and, optionally, other data sources.

When you run the lineage harvester, the content in your local folder is sent to the Collibra Data Lineage server for processing.

Note For more information about the scope, see the overview of supported data sources.