Technical lineage and stitching for BI tool integrations

BI tools, such as Power BI and Tableau, allow you to build reports that help you visualize and understand your data. To trust the data in your report, it's essential to know where the data came from. Collibra Data Lineage allows you to create a technical lineage, to trace the data from your data sources to your reports. Stitching then creates relations between the data objects in your technical lineage and the corresponding assets in Data Catalog, to give you a complete picture of your data landscape and all critical metadata.

In this topic, we examine the relationship between technical lineage and stitching.

Note This topic applies to MicroStrategy, Power BI, SSRS-PBRS and Tableau. Collibra Data Lineage currently does not offer stitching for Looker assets.

For a more technical perspective, see Technical overview of BI tool lineage and Missing stitching and possible solutions.

Stitching: The bridge between ingestion and technical lineage

Keep in mind that metadata ingestion (which results in the creation of assets in Collibra) and technical lineage are separate and independent concepts. The single, seamless process of integrating a BI tool for the purpose of technical lineage could lead one to think otherwise.

A technical lineage illustrates the flow of data in your external data sources. It does not inherently tell you anything about your assets in Collibra. The bridge between the metadata you ingest as assets in Data Catalog and the technical lineage, is stitching. As it concerns a technical lineage graph, stitching or the lack of stitching is reflected only in the color of the nodes in the technical lineage.

 

  • A yellow node indicates stitching. Specifically:
    • There is an asset in Collibra with a full name that exactly matches the data object in the technical lineage.
    • A relation of the type "Data Element targets / sources Data Element" is created between the asset and the data object, and shown on the asset page.
    • In the Stitching tab, the Found In column indicates that the database table was found in both Data Catalog and the technical lineage.
  • A gray node indicates a lack of stitching.
    • There is no asset in Data Catalog with a full name that exactly matches the name of the data object.
    • In the Stitching tab, the Found In column indicates that the database table was found only in the technical lineage.

If the database node is missing from the technical lineage graph, we refer to this as "missing stitching". This can happen if, for example, your BI tool has limited support for custom SQL, or if your integration includes a data source that is not yet supported by Collibra Data Lineage. In these situations, the relations required to recognize the database are not exposed.

Tip If you can't view a technical lineage because you lack the permissions, you can still identify stitching by viewing a diagram. A relation of the type "Data Element targets / sources Data Element" between, for example, a Tableau Data Attribute asset and Column asset in a diagram, indicates stitching.

Full path, full name matching

When you integrate your BI tool, the full names of the assets that are created in Data Catalog reflect the full paths (also considered the full names) of the corresponding data objects in the external data source. The full paths to data objects follow this hierarchy:

(system name) > database name > schema name > table name > column name

The system name is only relevant if you specify one as part of your pre-integration preparation. For complete information, go to Prepare the Data Catalog physical data layer.

To stitch assets in Data Catalog to data objects in the technical lineage, Collibra Data Lineage looks at the full names of assets in Data Catalog and the full names of data objects in your data source, which figure in the technical lineage. If there is an exact match in the full names, stitching is achieved.

Note The full path represents the full name of an asset, not the display name. As such, you can change the display name of an asset without breaking stitching, but if you change the full name of an asset, and it no longer exactly matches the full name of the corresponding data object, stitching will break.

If an ingestion job was successful, and it's true that the full names of the assets in Data Catalog are taken directly from the full names of the corresponding data objects, then how is it possible that the full names don't match? The possibilities are addressed in the following section.

What causes stitching to break?

The following scenarios result in a lack of stitching:

Scenario Why stitching breaks

During integration of your BI tool, the API returns a technical name, IP address, or hostname of the database, instead of the true name of the database.

The database name returned by the API doesn't match the name of the Database asset you created when you prepared the Data Catalog physical data layer.

You have registered a schema-less data source, for example HiveQL, MySQL or Teradata.

The full names of assets don't match because the full path hierarchy is altered because of the lack of a schema name.

You haven't prepared the Data Catalog physical data layer, or did so incompletely or erroneously.

  • The database name returned by the API doesn't match the name of the Database asset you created when you prepared the Data Catalog physical data layer.
  • The name of the System asset you created doesn't match the name of the system of the data source that you register, as specified in the configuration file.
  • You forgot to create the required relation between the Database asset and the Schema asset that was created when you registered your data source.

A database query includes a function or query that Collibra Data Lineage does not support.

The relations required to recognize the database are not exposed, resulting in "missing stitching".

You experience a rare exception, for example, SAP label names v. technical names.

When connecting to an SAP HANA data source, some BI tools use the label name instead of the technical name. This can result in a mismatch between the name of the data source in the technical lineage and the Database asset in Collibra.

Note During the ingestion process, relations of the type "Data Element targets / sources Data Element" are automatically created between certain assets. Any relations of this type that you manually create between assets will be deleted during the synchronization process.

Creating the technical lineage

Let's start with a lifehack: create the technical lineage without giving any thought to stitching. Specifically, prepare your source ID configuration file as you want, for filtering or to specify a system name, but don't worry about database mapping. Run the lineage harvester and analyze the technical lineage, to see what the APIs return for the database names. You can then set up database mapping in your source ID file and run the harvester again.

What you've done so far

Important 

In Collibra 2024.05, we've launched a new user interface (UI) for Collibra Data Intelligence Platform! You can learn more about this latest UI in the UI overview.

Use the following options to see the documentation in the latest UI or in the previous, classic UI:

  • You've pulled in data from a data source to your BI tool, and with that data set, you've created a report.
  • You've either:
    • Prepared a lineage harvester configuration file and run the lineage harvester.
    • Added the relevant Edge capability and run the Edge job.

You now have:

  • A technical lineage that shows the flow of data from the data source to your BI tool.
  • Assets in Data Catalog that represent the data objects in your data source. Among these assets are:
    • Assets that represent the data set you used to create the report in your BI tool. These are assets of child asset types of the BI Data Model asset type, for example Power BI Data Model and Tableau Data Model assets.
    • Assets that represent the report in your BI tool. These are assets of child asset types of the BI Report asset type, for example Power BI Report and Tableau Report assets.
       

Analyze the technical lineage

Go to the asset page of your BI Data Model asset and click the Technical Lineage tab. As shown in the following image, there will most likely be three nodes or groupings of nodes:

  • The external database.
  • The BI data model.
  • The BI report.

 

The first thing we notice is that the database node has a gray background and the other two have a yellow background. The yellow nodes represent BI assets and data objects. As such, we say that this part of the technical lineage graph depicts BI lineage.

Ultimately, what we want is for all three nodes to have the yellow background. Technically speaking, that means:

  • Lineage is confirmed upstream of the BI lineage.
  • The data sources that feed into the database node are shown.

Let's examine more closely these three nodes.

Note If you are integrating MicroStrategy, there will be four groups of nodes. In reference to the previous image, the BI data model node will consist of two groups nodes, one representing MicroStrategy Attributes and one representing MicroStrategy Facts. There could also be four groups if you are integrating SQL Server Reporting Services (SSRS) or Power BI Report Server (PBRS) and have a shared data set.

The external database

 

This node represents the table from the database you used to create the report in your BI tool. It is returned by the API and is shown in the technical lineage, as long as:

  • You have the required roles and permissions in your BI tool, to access the data in your data sources. For example, in Tableau, you need certain roles and permissions to access external data objects.
  • There are no unsupported custom SQL transformations or functions.
  • No errors have caused the integration to fail.

By the fact of its presence in the technical lineage, we know that the lineage harvester collected the source code from the BI tool and identified the flow of data from this data source to the BI data model. This node is a prerequisite for stitching. If it is not shown in the technical lineage, stitching is not possible.

Note The node might be yellow if you previously ingested metadata in Data Catalog that matches the database tables used in your dataset.

The gray background indicates that there might be a Table asset in Collibra that corresponds with this database table, but their full names do not exactly match.

Tip Look closely at the names of these nodes, to correctly identify if the nodes represent data objects from the data source or from your BI tool. In this example, you can tell by the names that the two yellow nodes are the BI data set and BI report nodes. When you view a technical lineage, it could be that the database and BI data set are stitched, and the BI report node does not appear in the technical lineage. This could be the case if you're viewing the lineage at the column level, and the attribute that the column represents is not used in the report. At first glance, one might incorrectly think that the database node, which is essential for stitching, is not shown.

The BI data model

 

This node represents the data set that you used to create the report in your BI tool. It will always be shown in the technical lineage, because it is the target of the database table and the source of the BI report.

The yellow background indicates that the name of the BI Data Model asset in Data Catalog matches exactly the name of this data set in the technical lineage.

Note This node is always stitched because Collibra Data Lineage knows the full name of the data set in your BI tool, and it creates the corresponding BI Data Model asset with the exact same name. This is referred to as BI stitching.

The BI report

 

This node (or grouping of nodes) represents the report you created in your BI tool. It is always part of the technical lineage.

Like the BI data model node, this node is always stitched because Collibra Data Lineage knows the full name of the report in your BI tool, and it creates the corresponding BI Report asset with the exact same name.

Tip While the BI report node is always part of the technical lineage, it might not initially be visible when you view the technical lineage. If, for example, you're viewing the lineage at the column level, and the attribute that the column represents is not used in the report, there will be no arrow leading to the report node in the technical lineage. In this case, right-click on the data model node and click Table lineage to pull back and view the table-level lineage. The BI report node will appear and you will see which columns/data attributes are used in the report.

How to achieve stitching

Let's have a look at a typical database mismatching scenario.

First, let's look at the name of an unstitched database table in the technical lineage graph:

MODEL.PRODUCT CATEGORY [ADVENTUREWORKS::database]

We can identify the following:

  • The database name: ADVENTUREWORKS
  • The schema name: MODEL
  • The table name: PRODUCT CATEGORY

Now let's find the table in the Stitching tab:

  1. Click the Settings tab.
  2. Click Show status.
  3. Click the Stitching tab.
    The Stitching tab shows a list of all tables that exist in Data Catalog and on the Collibra Data Lineage service instance.

Use the Search field to find the unstitched database table PRODUCT CATEGORY.

In the Found in column, the value "Technical Lineage" confirms what we already know: the table was found only in the technical lineage. An exactly matching asset was not identified in Collibra.

Now try to find a likely match. Look for a table that has the same name and the value "Catalog" in the Found in column.

The table shown in the following image looks like a match. The schema and table names match exactly; only the database names differ.

To achieve stitching, you need to create a source ID configuration file and configure database mapping.

For more missing stitching scenarios and suggestions for resolving the issue, go to Missing stitching and possible solutions.