Custom technical lineage JSON file details

This topic describes the properties that you need to include in your JSON files, for both the single-file and batch definition options.

If you opt for the batch definition option, you need to create a folder with all of your JSON files and specify the folder in your lineage harvester configuration file. The harvester then accesses the folder, zips the content and ingests it for processing.

Which files do you need in your batch folder?

Let's say that you create a folder and name it custom-lineage. In this folder, you need the following:

  • Exactly one metadata file, to provide the JSON architecture version, the data source type, and asset type UUIDs of the assets you want to include in the technical lineage.
  • Optionally, one or more asset files, to provide a list of data objects you want to include in the technical lineage and define the data object hierarchy to achieve stitching.
  • One or more lineage files, to define the lineage relation between two or more data objects.
  • Optionally, a subfolder of source code files that contain the transformation code.
Example 
__CUSTOM-LINEAGE__
    ├── assets-domain1.json
    ├── assets1.json
    ├── lineage.json
    ├── lineage-extra.json
    ├── metadata.json
    └── source_codes
        ├── sc1.sql
        └── sc2.py

Metadata file

Your metadata file has to be named metadata.json. Format the file as shown in the following image:

Example 
{
  "version": 3, 
  "application_name": "databricks",
  "asset_types":{
    "Column":{"uuid": "00000000-0000-0000-0000-000000031008"},
    "Table":{"uuid": "00000000-0000-0000-0000-000000031007"},
    "Database":{"uuid": "00000000-0000-0000-0000-000000031006"},
    "Schema":{"uuid": "00000000-0000-0000-0001-000400000002"}
  }
}
Tip 
Section Description

version

The version of the JSON architecture. For batch-file instruction, the value must be 3.

application_name

The type of data source for which you are creating a technical lineage.

This helps us to better understand your needs and make more informed decisions concerning future integrations.

asset_types

The asset types and UUIDs of the asset types you want to include in the technical lineage.

Important If you choose to include asset files in your batch definition, the values (meaning the asset types) that you specify in this property must match the values that you specify in the type properties in your asset files. Likewise, the values that you specify in this property must match the asset types that you mention in your lineage files.

Assets files

Optionally, you can include one or more assets files. You use asset files to provide the list of data objects you want to include in the technical lineage and define the data object hierarchy. The props property allows you to specify the full names and domain IDs of the assets.

Tip 

Don't use asset files in the following scenarios:

  • Your data source consists of the traditional (System) > Database > Schema > Table > Column asset types and hierarchy. In that case, full names are automatically, correctly constructed.
  • You are working with assets that are not part of that traditional asset hierarchy (in which case, you need to use the props property to achieve stitching) and you define props in one or more lineage files.

The names of your assets files have to follow the format assets<something-unique>.json.

Asset files can consist of nodes, parent, and leaf kinds of assets. In the following example code, we used the nodes property to specify the highest levels of the data object hierarchy that we want to view in the technical lineage: Database and Schema. We then used the parent and leaf properties to build out the lower levels of the data object hierarchy: Table and Column, respectively.

parent assets represent what we traditionally refer to as the table-level lineage. leaf assets represents what we traditionally refer to as the column-level lineage.

Keep in mind that the property names nodes, parent, and leaf are designed to be non-restrictive, so you can define a hierarchy to reflect the hierarchy of any asset types (similar to the database > schema > table > column hierarchy), including your custom asset types.

Tip For examples of how to configure the props property, as shown in the following code examples, see Using the props property.

Tip 

Property

Description

nodes

A JSON element in which you specify the highest levels of the hierarchy. In the example code, the nodes specify the hierarchy of GCS File System > GCS Bucket.

Example 
{
	"nodes": [
		{
			"name": "GCS1",
			"type": "GCS File System"
		}, 
		{
			"name": "GCS-B1",
			"type": "GCS Bucket"
		}
	],
	"props": {
		"fullname": "<full name of the GCS Bucket asset>",
		"domain_id": "<domain of the GCS Bucket asset>"
	}
}
name

The name of the node data object. The value is case-sensitive.

type

The type of data object of the specified node, for example: System, Database, Dashboard, or Report. The value is case-sensitive.

Important The values (meaning the asset types) that you specify for this property must match the values that you specify in the asset_types property in your metadata file.

parent

A lower-level data object in a hierarchy for which the highest levels are specified in the nodes section. The parent property represents what we traditionally refer to as the table-level lineage.

When specifying parent data objects, you also have to include the nodes information, as shown in the following example code.

Example 
{
	"nodes": [
		{
			"name": "GCS1",
			"type": "GCS File System"
		},
		{
			"name": "GCS-B1",
			"type": "GCS Bucket"
		}
	],
	"parent": {
		"name": "DIR1",
		"type": "Directory"
	},
	"props": {
		"fullname": "<full name of the Directory asset>",
		"domain_id": "<domain of the Directory asset>"
	}
}

Tip Each parent object can contain leaf data objects. For example, you can use the parent property to specify a table, and use the leaf properties to specify the columns in the table.

name

The name of the parent data object. The value is case-sensitive.

type

The asset type of the parent data object, for example: Table, Directory, Dashboard, or Report. The value is case-sensitive.

Important The values (meaning the asset types) that you specify for this property must match the values that you specify in the asset_types property in your metadata file.

leaf

The lowest level data object in your hierarchy. The leaf property represents what we traditionally refer to as the column-level lineage.

When specifying leaf data objects, you also have to include the nodes and parent information, as shown in the following example code.

The names of parents and leaf data objects can be identical if the data objects with the same names are sub-objects of different nodes data objects.

Example 
{
	"nodes": [
		{
			"name": "GCS1",
			"type": "GCS File System"
		}, 
		{
			"name": "GCS-B1",
			"type": "GCS Bucket"
		}
	],
	"parent": {
		"name": "DIR1",
		"type": "Directory"
		},
	"leaf": {
		"name": "data.xls",
		"type": "File"
		},
	"props": {
		"fullname": "<full name of the File asset>",
		"domain_id": "<domain of the File asset>"
	}
}
name

The name of the leaf data object. The value is case-sensitive.

type

The asset type of the leaf data object, for example: Column, Dashboard, or Report. The value is case-sensitive.

Important The values (meaning the asset types) that you specify for this property must match the values that you specify in the asset_types property in your metadata file.

props

This property allows you to specify the full name and domain ID of an asset for the purpose of stitching, regardless of asset type hierarchy.

When you add the props property to define the full name of an asset, it applies to the last asset in the array.

Tip For examples of how to configure the props property and how to use it for a custom hierarchy, see Using the props property.

Important considerations
  • You don't need to use this property for the traditional (System) > Database > Schema > Table > Column asset types and hierarchy. In fact, assets files are not needed or recommended for those asset types, as the full name is automatically, correctly constructed for that hierarchy. Instead, use this property to specify the full names of assets that are not part of that traditional asset hierarchy.
  • You must specify in your metadata file the asset types and UUIDs of all the assets types used.
  • If the useCollibraSystemName property in your lineage harvester configuration file is set to true, the system data object is used to stitch to the System asset in Data Catalog. If the useCollibraSystemName property is set to false in your lineage harvester configuration file, do not specify the system data object in this section, or else stitching will fail.
A word about file processing order and inadvertently specifying the same asset more than once

Assets files and lineage files are processed in the following order: first, all assets files in alphabetical order, followed by all lineage files in alphabetical order. If you choose to specify props for an asset, we recommend that you do so in either an assets file or a lineage file; not both. For any asset that is inadvertently defined more than once, the first occurrence, with respect to the processing order, is the occurrence that is used.

In other words:

  • If you inadvertently define a single asset, with props, in both an assets file and a lineage file, the props values in the assets file are used.
  • If you inadvertently define a single asset, with props, more than once in a single assets file, or in multiple assets files, the first occurrence of the asset, with respect to the processing order, is used along with the props values defined for that occurrence of the asset.
fullname

The full name of the asset in Collibra. Names are case-sensitive.

domainId

The reference ID of the domain in which the asset exists in Collibra.

Using the props property

The following examples offer some guidance as to when to use the props property and how to configure it.

Lineage files

You can have one or more lineage files in the folder. The names of your lineage files have to follow the format lineage<something-unique>.json.

You use the lineage file to define the lineage relation between two or more data objects. The lineage relations are shown as edges in the technical lineage graph. The edges represent the data flow from a source to a target.

This section contains the path from a source to a target and defines the transformation code or transformation references to be processed by the Collibra Data Lineage service.

Note  If the useCollibraSystemName property in your lineage harvester configuration file is set to true, the system data object is used to stitch to the System asset in Data Catalog. If the useCollibraSystemName property is set to false in your lineage harvester configuration file, do not specify the system data object in these files, or else stitching will fail.
Tip 
Example 
[
  {
    "src": {
      "nodes": [{"name":"DB1", "type": "Database"}, {"name": "SCH1", "type": "Schema"}],
      "parent": {"name": "TB1", "type": "Table"},
      "leaf": {"name": "COL1", "type": "Column"},
      "props": {
	  "fullname": "<full name of the leaf asset>",
	  "domain_id": "<domain of the leaf asset>"
	  },
    },
    "trg": {
      "nodes": [{"name":"DB1", "type": "Database"}, {"name": "SCH1", "type": "Schema"}],
      "parent": {"name": "TB2", "type": "Table"},
      "props": {
	  "fullname": "<full name of the parent asset>",
	  "domain_id": "<domain of the parent asset>"
    },
    "source_code" : {
      "path": "<folder name>/sc1.sql", 
      "highlights": [{"start": 71, "len": 69 }, ...],
      "transformation_display_name": "middle bubble"
    }
  }
 }
]

Properties

Description
src

The hierarchical path to the source data object. This property represents where the data comes from for a transformation.

Important The source of a lineage can only be a parent or a leaf.

Example 
{
    "src": {
      "nodes": [{"name":"DB1", "type": "Database"}, {"name": "SCH1", "type": "Schema"}],
      "parent": {"name": "TB1", "type": "Table"},
      "leaf": {"name": "COL1", "type": "Column"}
}
trg

The hierarchical path to the target data object. This property represents where the data flows to.

Important The target can be a parent or a leaf; however, if the source is a parent, the target must be a parent.

Tip If the target asset is a parent asset and the source asset is a leaf asset, we refer to the lineage as "indirect lineage". If the target asset is a parent asset and the source asset is a parent asset, we refer to the lineage as "table-level lineage".

Example 
{
    "trg": {
      "nodes": [{"name":"DB1", "type": "Database"}, {"name": "SCH1", "type": "Schema"}],
      "parent": {"name": "TB2", "type": "Table"}
}
props

An optional property that allows you to specify the full name and domain of an asset, for the purpose of stitching.

This property is not required for Database, Schema, Table and Column asset types.

A word about file processing order and inadvertently specifying the same asset more than once

Assets files and lineage files are processed in the following order: first, all assets files in alphabetical order, followed by all lineage files in alphabetical order. If you choose to specify props for an asset, we recommend that you do so in either an assets file or a lineage file; not both. For any asset that is inadvertently defined more than once, the first occurrence, with respect to the processing order, is the occurrence that is used.

In other words:

  • If you inadvertently define a single asset, with props, in both an assets file and a lineage file, the props values in the assets file are used.
  • If you inadvertently define a single asset, with props, more than once in a single assets file, or in multiple assets files, the first occurrence of the asset, with respect to the processing order, is used along with the props values defined for that occurrence of the asset.

source_code

The transformation code that determines how the technical lineage is constructed. This can be a descriptive string or a SQL statement that manipulates data.

This section is optional.

path
 

The path and name of the source code file that contains the transformation code. The path relative to the source_codes folder, which is in the same folder as the lineage JSON files.

highlights

This optional property identifies a string of transformation code in a source code file to be highlighted in the source code pane at the bottom part of the technical lineage graph. The entire lines that include the transformation code are highlighted.

The string must be a subset of the string of transformation code that is defined by the start and len properties.

start

The start position of the string of the transformation code to be highlighted. The start position is in characters, not bytes.

len

The length of the string of the transformation code to be highlighted. The length is in characters, not bytes.

transformation_display_name

The name of the transformation when looking at the transformations view in the technical lineage viewer.

Source codes subfolder and files

You can provide a subfolder of source code files that define the transformation details. The source code folder and your JSON files must be in the CUSTOM_LINEAGE folder, along with the JSON files. If it's not, an error occurs indicating that the lineage harvester cannot find the source code files.

The source code paths are relative to the CUSTOM_LINEAGE folder.

Example 
  • source_codes/sc1.sql
  • source_codes/another-subfolder/sc2.sql

What happens if I choose not to provide source code files?

If you are using the lineage harvester and there are no source code files to analyze, the batch stats are empty, as shown below. The lineage relations are still created, but because batch stats are directly linked to the source codes, if source code files are not provided, this is expected.

Batch stats:
	Parsing errors: 0
	Analysis errors: 0
	Done: 1

The Done: 1 result is a dummy entry, so that the source appears in the Sources tab page.

Example JSON files

For some example JSON files, go to Custom technical lineage JSON file examples.

If you opt for the single-file definition option, you use a lineage.json file to define the lineage between two or more data objects, and optionally include transformations details to create the custom technical lineage.

The following sections in the JSON file define different parts in the resulting Collibra technical lineage graph:

  • tree, which defines the data object hierarchy. The data objects are shown as nodes in the technical lineage graph.
  • lineages, which defines the lineage relation. The lineage relations are shown as edges in the technical lineage graph. The edges represent the data flow from a source to a target.
  • codebase_files, which points to the source code files that include transformation details.

To create a simple custom technical lineage, you need to include assets and lineages sections in your JSON file. You can add the transformation code in the lineages section.

To create an advanced custom technical lineage, you need to include assets, lineages and codebase_files sections in your JSON file. You add references to the transformation code in source code files in the codebase_files section.

Transformation code in both simple and advanced custom technical lineages is shown in the source code pane at the bottom part of the technical lineage graph.

Requirements and restrictions

The source code files must be in the same directory as the lineage.json file. Otherwise, an error occurs indicating that the lineage harvester cannot find the source code files.

Sections

Sections

Description

version

The version of the JSON architecture. Specify the value of 1.0, which is the only supported version.

tree

This section contains tree definitions of data objects between which lineages can be defined. The data objects are systems, databases, schemas, tables, views, columns, dashboards and reports.

Each node of a tree contains the name, type and optionally children or leaves properties which form a hierarchy of data objects. You must define a node only once in this section. With the nested tree format, you can reuse the properties of one node for multiple children. For example, you can define a database once and use the children array to define multiple tables in the database.

Tip Usually, the structure you map is the following: system > database > schema > table > column. The system is optional, unless the useCollibraSystemName property is set to true in your lineage harvester configuration file. Collibra Data Lineage can stitch these data objects to assets in Data Catalog. However, you can also map custom objects, for example dashboards and reports. Custom objects cannot be stitched to assets in Data Catalog.

Important If the useCollibraSystemName property is set to false in your lineage harvester configuration file, do not specify the system data object in this section, or else stitching will fail.

lineages

This section contains the path from a source to a target and defines the transformation code or transformation references to be processed by the Collibra Data Lineage service.

Important If the useCollibraSystemName property is set to false in your lineage harvester configuration file, do not specify the system data object in this section, or else stitching will fail.

codebase_files

This optional section defines the reference to source code files. Store the source code files that contain the transformation code in the same directory as the lineage.json file.

Include this section only when you create an advanced custom technical lineage.

tree section properties

Properties

Description
name

The name of your data object. Specify this property with the system name, database name, schema name, table name, view name or column name.

The following rules apply when you specify this property:

  • The names are case sensitive.
  • The names of children and leaves can be identical if the children and leaves with the same names are in different parent nodes.
type

The type of your data object. You can specify one of the following options: system, database, schema, table, view, column, dashboard or report.

If the useCollibraSystemName property in your lineage harvester configuration file is set to true, the system data object is used to stitch to the System asset in Data Catalog. If the useCollibraSystemName property is set to false in your lineage harvester configuration file, do not specify the system data object in this section, or else stitching will fail.
children

The sub-objects that have a hierarchical relation to the defined data object.

Each child can contain children properties, except for the penultimate child. The penultimate children property must contain the leaves property. The leaves property cannot contain a children property.

For example, you can use the children property to define a table and use the leaves properties to define columns that have a relation to the table node.

Each child and leave have the name and type properties and the optional catalog_fullname, catalog_domain_id, catalog_asset_type_name and catalog_asset_type_uuid properties.

leaves

The sub-objects of an object that is defined in a children property, but cannot have sub-objects of their own.

A technical lineage is defined as relations between leaf nodes of the tree.

The value of the type property of the leaves property must be column or report. Indirect and table-level technical lineages are not supported. For the workarounds to create a table level or indirect technical lineage, see Programming considerations.

lineage section properties

Properties

Required Description
src_path
Yes

The hierarchical path to the source data object. This data object is defined as a leaf in the tree section.

This property represents where the data comes from for a transformation.

trg_path
Yes

The hierarchical path to the target data object. This data object is defined as a leaf in the tree section.

This property represents where the data flows to.

<data objects>
Yes

An ordered array of data object names. This array is required to define the sub-objects of the src_path and trg_path properties.

Specify the array with the data object names that start from the top of the tree section and finish at a leaf node.

This example shows data objects that can be stitched: system > database > schema > table > column.

This example shows data objects that cannot be stitched: dashboard > report > column.

If the useCollibraSystemName property in your lineage harvester configuration file is set to true, the system data object is used to stitch to the System asset in Data Catalog. If the useCollibraSystemName property is set to false in your lineage harvester configuration file, do not specify the system data object in this section, or else stitching will fail.
mapping

Yes

Simple custom technical lineage only

The mapping name. This property specifies a name for the transformation code.

source_code

Yes

Simple custom technical lineage only

The transformation code, which determines how the technical lineage is constructed.

The transformation code can be a descriptive string or a SQL statement that manipulates data.

mapping_ref

No

Advanced custom technical lineage only

This property contains the name of the mapping reference to the transformation code in source code files. This property also contains the position and length of the transformation code to be highlighted in the technical lineage graph.

source_code

No

Advanced custom technical lineage only

The name of the source code file that contains the transformation code. The transformation code can be a SQL statement, code that manipulates data or a descriptive string.

The source code file must be in the same folder as the lineage.json file.

mapping

No

Advanced custom technical lineage only

The unique descriptor of a part of transformation code in a source code file that is in the same directory as the lineage.json file.

A source code file can contain different parts of transformation code that represent different data flows. This property indicates the referenced data flow.

The value of this property is the same as the value of the mapping_refs property in the codebase_files section.

codebase_pos

No

Advanced custom technical lineage only

The positions indicate a string of the transformation code in a source code file to be highlighted in the bottom part of the Collibra technical lineage graph. The whole lines that include the transformation code are highlighted.

The string must be a subset of the string of the transformation code that is defined by the pos_start and pos_len properties of the mapping_refs property in the codebase_files section.

pos_start

No

Advanced custom technical lineage only

The start position of the string of the transformation code to be highlighted. The start position is in characters, not bytes.

The value must be equal to or greater than the value of the pos_start property of the mapping_refs property in the codebase_files section.

pos_len

No

Advanced custom technical lineage only

The length of the string of the transformation code to be highlighted. The length is in characters, not bytes.

Specify a value in the following range:

  • Equal to or greater than 1.
  • Less than or equal to the length of the string that is defined by the pos_len property of the mapping_refs property in the the codebase_files section.

For example, if you specify "pos_start": 10 and "pos_len": 160 in the codebase_files section, specify a value for this property in the range of 0 - 149.

codebase_files section properties

Properties

Description
<source code path>

The file path to source code files that contain the transformation code. The transformation code can be a SQL statement or code that manipulates data.

The source code file must be in the same directory as the lineage.json file.

mapping_refs

The mapping of the transformation code and the position of the transformation code that is shown in the bottom part of the technical lineage graph.

This property defines a string of the transformation code in the source code file to be shown in the technical lineage graph. The string must include the string that is defined by the pos_start and pos_len properties of the mapping property in the lineage section.

<mapping>

The unique descriptor of a part of transformation code in a source code file that is in the same directory as the lineage.json file.

A source code file can contain different parts of transformation code that represent different data flows. This property indicates the referenced data flow.

The value must match the value of the mapping property in the lineage section.

pos_start

The start position of the string of the transformation code. The start position is in characters, not bytes.

Specify a value in the following range:

  • Equal to or greater than 0.
  • Less than or equal to the value of the pos_start property in the mapping property in the lineage section.
pos_len

The length of the string of the transformation code. The length is in characters, not bytes.

Specify a value in the following range:

  • Greater than or equal to 1.
  • Less than or equal to the length of the source code file minus the start position.

For example, if you specify "pos_start": 10 and the file length is 160 characters, specify a value for this property in the range of 1 - 150.

Programming considerations

Currently, there is no native support for indirect and table-level lineages. As a workaround, you can specify "type": "column" and "name": "*" for the leaves property to create a table level or indirect technical lineage. With this specification, the indirect technical lineage is shown as a solid line instead of a dashed line in the Collibra technical lineage graph, and is always shown, regardless of whether or not the Show indirect dependencies option is enable or disabled.

Example

For some example JSON files, go to Custom technical lineage JSON file examples.