Create a technical lineage via the lineage harvester

This topic describes the general steps on how to use the lineage harvester to create a technical lineage.

Note For information on the differences between creating a technical lineage via the lineage harvester and Edge, go to Collibra Data Lineage.

Select a data source, to show the relevant integration steps.

Currently, information is shown for:

Requirements and permissions

  • Collibra Data Intelligence Platform.
  • You have purchased Collibra Data Lineage.
  • A global role with the following global permissions:
    • Catalog, for example Catalog Author
    • Data Stewardship Manager
    • Manage all resources
    • System administration
    • Technical lineage
  • A resource role with the following resource permissions on the community level in which you created the domain:
    • Asset: add
    • Attribute: add
    • Domain: add
    • Attachment: add
Important  Amazon RedshiftAzure SQL serverAzure Synapse AnalyticsGreenplumHiveIBM Db2PostgreSQLMicrosoft SQL ServerMySQLNetezzaSAP HANATeradata requirements:
Ensure that you meet the Azure Data Factory-specific permissions described in Set up Azure Data Factory.
  • As a technical lineage user, ensure that your Catalog Author global role has the following global permissions. With these permissions, Collibra Data Lineage can process the lineage and synchronize the results to Data Catalog to create technical lineage.
    • Catalog > Advanced Data Type > Add
    • Catalog > Advanced Data Type > Remove
    • Catalog > Advanced Data Type > Update
    • Catalog > Technical lineage
  • As a Data Catalog user, ensure that your Edge integration engineer global role has the following global permissions. With these permissions, you can create connections and capabilities on Edge, configure the integration, and synchronize the integration.
    • Manage connections and capabilities
    • View Edge connections and capabilities
  • As a Google Dataplex user, ensure that you have the following permissions and roles. Use your service account when you create a GCP connection in step 1. You can choose to create table level lineage or column level lineage (beta) for Dataplex when you synchronize the capability in step 3.
    • To create table level lineage: 
      • Enable the Data Lineage API in Dataplex for the projects that you want to harvest lineage from.
        For more information, go to Data Lineage API in Google Cloud documentation.
      • The Data Lineage Viewer role.
      • The BigQuery Admin role if you want Collibra Data Lineage to collect lineage from stored procedures that you created and those created by other Dataplex users.
      • The bigquery.jobs.get permission.
        For more information, go to IAM basic and predefined roles reference in the Google Cloud documentation.
    • To create column level lineage (beta):
      • Enable the exportMetadata API, which is in Preview. Contact your Google Technical Account Manager to enable the API.
      • For the GCS bucket:
        • The storage.objects.create permission in the Storage Object Creator (roles/storage.objectCreator) role.
        • The storage.objects.list and storage.objects.get permissions in the Storage Object Viewer (roles/storage.objectViewer) role.
      • For creating Export: 
        • The datacatalog.entries.exportAll permission in the Data Catalog admin role (roles/datacatalog.admin) role.
        • Alternatively, grant roles/datacatalog.metadataExporter to your Service Account or the User.
      • The bigquery.jobs.get permission to get SQL transformation code.
        For more information, go to IAM basic and predefined roles reference in the Google Cloud documentation.
    • When you synchronize technical lineage for Google Dataplex in step 3, you can add Project IDs that you want to harvest lineage from. If you want to have Project IDs available for selection when you add Project IDs, ensure that the service account has the resourcemanager.projects.get permission to GCP Projects where Dataplex is enabled. If the service account does not have this permission, you can enter the Project IDs manually on the Synchronization configuration page.
  • You need read access on information_schema. Only views that you own are processed.
    You need read access on the SYS schema.
    If you are using the lineage harvester, you need read access on information_schema:
    • bigquery.datasets.get
    • bigquery.tables.get
    • bigquery.tables.list
    • bigquery.jobs.create
    • bigquery.routines.get
    • bigquery.routines.list

    If you are using Edge, you also need:

    • resourcemanager.projects.get
    • bigquery.readsessions.create
    • bigquery.readsessions.getData
    • SELECT, at table level. Grant this to every table for which you want to create a technical lineage.
    • SHOW VIEW, at table level. Grant this to every table for which you want to create a technical lineage.
    • Read access to the SYS schema or the tables in the schema.
    • VIEW DEFINITION on all relevant views and procedures.
    Ensure that your service account token has the Read-Only permission.
    Ensure that you have the permission to copy the target/ directory, which is generated by running the dbt compile command, to a local folder.

    You need Monitoring role permissions.

    To create technical lineage from calculated views in SAP HANA Classic on-premises, you need the following permissions: 

    • SELECT on the following views:
      • _SYS_REPO.ACTIVE_OBJECT
      • _SYS_REPO.ACTIVE_OBJECTCROSSREF
      • SYS.OBJECT_DEPENDENCIES
    • The CATALOG READ system privilege

    To create technical lineage from calculated views in SAP HANA Cloud/Advanced, you need the following permission: 

    • The CATALOG READ system privilege

    A role with the LOGIN option.
    SELECT WITH GRANT OPTION, at Table level.
    CONNECT ON DATABASE
    You need read access on the SYS schema and the View Definition Permission in your SQL Server.
    You need read access on definition_schema.
    • GRANT SELECT, at table level. Grant this to every table for which you want to create a technical lineage.
    • The role of the user that you specify in the username property in lineage harvester configuration file must be the owner of the views in PostgreSQL.
    You need read access on the DBC.
    You need read access to the following dictionary views:
    • all_tab_cols
    • all_col_comments
    • all_objects
    • ALL_DB_LINKS
    • all_mviews
    • all_source
    • all_synonyms
    • all_views
    Note By default, the lineage harvester queries the all_source table to retrieve Package bodies. However, this requires the EXECUTE privilege. As an alternative, you can direct the harvester to query the dba_source table, which requires the SELECT_CATALOG_ROLE role. To do so, you need to:
    • If via Edge: Replace all_source by dba_source in the Other Queries field in your Edge capability.
    • If via the CLI lineage harvester: Replace all_source by dba_source in the file ./sql/oracle/queries.sql, which is included in the ZIP file when you download the lineage harvester.
    • Your user role must have privileges to export assets.
    • You must have read permission on all assets that you want to export.
    The following permissions are required, regardless of the ingestion mode: SQL or SQL-API.
    • Ensure that the Snowflake user has the appropriate allowed host list. For details, go to Allowing Hostnames in Snowflake documentation.
    • You need a role that can access the Snowflake shared read-only database. To access the shared database, the account administrator the account administrator must grant the OBJECT_VIEWER and GOVERNANCE_VIEWER database role on the shared database to the user that runs the lineage harvester. The username of the user must be specified in the JDBC connection that you use to access Snowflake.
    Before you start the Power BI integration process, you have to perform a number of tasks in Power BI and Microsoft Azure. These tasks, which are performed outside of Collibra, are needed to enable the lineage harvester to reach your Power BI application and collect its metadata. For complete information, go to Set up Power BI.

    Collibra Data Lineage supports:

    • Power BI on the Microsoft Power Platform.
    • Power BI on Fabric.
    The configuration requirements and the integration are the same, regardless of your setup.

    Before you start the Tableau integration process, you have to perform a number of tasks in Tableau. For complete information, go to the following topics:

    You need the following roles, with user access to the server from which you want to ingest:

    • A system-level role that is at least a System user role.
    • An item-level role that is at least a Content Manager role.

    We recommend that you use SQL Server 2019 Reporting Services or newer. We can't guarantee that older versions will work.

    Before you start the Looker integration process, you need to set up Looker.

    • You need the following Admin API permissions:
      1. The first call we make to MicroStrategy is to authenticate. We connect to:
        <MSTR URL>:<Port>/MicroStrategyLibrary/api-docs/ and use GET api/auth/login.
        For complete information, see the MicroStrategy documentation.
        If this API call can be made successfully, you can ingest the metadata.
      2. The same connection:
        <MSTR URL>:<Port>/MicroStrategyLibrary/api-docs/, but with GET api/model/tables/<tableId>.
        For complete information, see the MicroStrategy documentation.
        This endpoint is needed to create lineage and stitching.
    • You need permissions to access the library server.
    • The lineage harvester uses port 443. If the port is not open, you also need permissions to access the repository.
    • You have to configure the MicroStrategy Modeling Service. For complete information, see the MicroStrategy documentation.
    Warning 

    Collibra Data Lineage uses the API 4.0 endpoints GET /queries/<query_id> and GET /running_queries. Due to a security update by Looker, the behavior of these endpoints has changed. Therefore, you must now:

    • Select the "Disallow Numeric Query IDs" option in Looker.
    • Ensure that your Looker user has the Admin role. The Admin role has the Administer permission, which is not available in the custom permission set.

    For complete information, see the Looker Query ID API Patch Notice.

    Note Column-level lineage is not generated for tables that are created by SQL statements, unless you provide the SQL statements by using the folder connection method.

    Steps

    Note Only Basic Authentication is supported. NTLM authentication, for example, is not.
    1. Optionally, connect to a proxy server.
    2. Ensure that you meet the Azure Data Factory prerequisites.
    3. Ensure that you have the correct Tableau versions and permissions, as described in the Set up Tableau topics.
    4. Complete the tasks in Power BI and Microsoft Azure, as described in the Set up Power BI topics.
    5. If you are a MicroStrategy on-premises customer, ensure that you have enabled Collibra to access your MicroStrategy data, as described in Set up MicroStrategy.
    6. Ensure that you have API3 credentials for authorization and access control. For complete information, go to Set up Looker.
    7. Prepare the Data Catalog physical data layer.
    8. Prepare an external directory folder for the lineage harvester.
    9. Prepare a domain for BI asset ingestion.
    10. Optionally, assign the attribute type State to the global assignment of the Power BI Workspace asset type. For complete information, go to Power BI workspaces.
    11. Download and install the lineage harvester.
    12. Prepare the lineage harvester configuration file.
      Note The project name in the configuration file must be the same as the full name of the Database asset.
    13. If necessary, prepare a <source ID> configuration file.
    14. Manually refresh your Power BI datasets.
      Important The first time you integrate Power BI, you need to make sure that the data in your Power BI datasets is up-to-date. Carry out this step only if this is the first time you're integrating Power BI in Data Catalog. After that, Microsoft automatically refreshes the datasets every 90 days. For complete information, see:
    15. Run the lineage harvester.

    You can define your Custom technical lineage via batch or single-file definition.

    1. Create a local folder.
    2. Create the following:
      • A single metadata file.
        Name the file metadata.json.
      • Optionally, one or more assets files. These are required if you want to achieve stitching.
        Name the files assets<something unique>.json.
      • One or more lineage files.
        Name the files lineage<something unique>.json.
      • Optionally, a folder for your source code files. This folder must be put in the same local folder.
      For guidance on creating these files, go to custom technical lineage JSON file examples.
    1. Create a local folder.
    2. Create a JSON file in the local folder and name the JSON file lineage.json.
      For guidance on creating the lineage.json, go to custom technical lineage JSON file examples.

      Note The JSON file must be named as lineage.json; otherwise, the process fails. You can have other types of files in this folder.

    3. If you want to create an advanced custom technical lineage, store all of the source code files that you want to reference in the JSON file in the same local folder. For more information about the simple and advanced custom technical lineage, go to Custom technical lineage.

    What's next?

    You can check the progress of the ingestion in Activities. The results field indicates how many relations were imported into Data Catalog.

    After the metadata is ingested in Data Catalog, you can go to the domain that you specified in your lineage harvester configuration file and view the newly created assets. These assets are automatically stitched to existing assets in Data Catalog.

    Warning We strongly recommend that you not edit the full names of any BI assets. Doing so will likely lead to errors during the synchronization process.

    Warning We highly recommend that you do not move the ingested assets to a different domain. If you do, the assets will be deleted and recreated in the initial BI Catalog domain when you synchronize. As a consequence, any manually added data of those assets is lost.