Custom technical lineage JSON file details
Warning The “single-file definition” option for custom technical lineage is now deprecated and will officially reach its end-of-life on July 31, 2026. We encourage you to transition to the “batch definition”, if you haven't already.
This topic describes the properties that you need to include in your JSON files, for both the single-file and batch definition options.
If you opt for the batch definition option, you need to create a folder with all of your JSON files and specify the folder in your lineage harvester configuration file. The harvester then accesses the folder, zips the content and ingests it for processing.
Which files you need in your batch folder
Let's say that you create a folder and name it custom-lineage. In this folder, you need the following:
- Exactly one metadata file, to provide the JSON architecture version, the data source type, and asset type UUIDs of the assets you want to include in the technical lineage.
- Optionally, one or more asset files, to provide a list of data objects you want to include in the technical lineage and define the data object hierarchy to achieve stitching.
- One or more lineage files, to define the lineage relation between two or more data objects.
- Optionally, a subfolder of source code files that contain the transformation code.
__CUSTOM-LINEAGE__
├── assets-domain1.json
├── assets1.json
├── lineage.json
├── lineage-extra.json
├── metadata.json
└── source_codes
├── sc1.sql
└── sc2.py
Metadata file
Your metadata file has to be named metadata.json. Format the file as shown in the following image:
{
"version": 3,
"application_name": "databricks",
"asset_types":{
"Column":{"uuid": "00000000-0000-0000-0000-000000031008"},
"Table":{"uuid": "00000000-0000-0000-0000-000000031007"},
"Database":{"uuid": "00000000-0000-0000-0000-000000031006"},
"Schema":{"uuid": "00000000-0000-0000-0001-000400000002"}
}
}
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"type": "object",
"properties": {
"version": {
"anyOf": [
{
"type": "integer"
},
{
"type": "string",
"pattern": "^\\d+$"
}
]
},
"application_name": {
"type": "string"
},
"asset_types": {
"type": "object",
"additionalProperties": {
"type": "object",
"properties": {
"uuid": {
"type": "string"
}
}
}
}
},
"required": [
"application_name",
"version"
]
}
| Section | Description |
|---|---|
|
version |
The version of the JSON architecture. For batch-file instruction, the value must be |
|
application_name |
The type of data source for which you are creating a technical lineage. This helps us to better understand your needs and make more informed decisions concerning future integrations. |
|
asset_types |
The asset types and UUIDs of the asset types you want to include in the technical lineage. Important If you choose to include asset files in your batch definition, the values (meaning the asset types) that you specify in this property must match the values that you specify in the |
Assets files
Optionally, you can include one or more assets files. You use asset files to provide the list of data objects you want to include in the technical lineage and define the data object hierarchy. The props property allows you to specify the full names and domain IDs of the assets.
Don't use asset files in the following scenarios:
- Your data source consists of the traditional (System) > Database > Schema > Table > Column asset types and hierarchy. In that case, full names are automatically, correctly constructed.
- You are working with assets that are not part of that traditional asset hierarchy (in which case, you need to use the
propsproperty to achieve stitching) and you definepropsin one or more lineage files.
The names of your assets files have to follow the format assets<something-unique>.json.
Asset files can consist of nodes, parent, and leaf kinds of assets. In the following example code, we used the nodes property to specify the highest levels of the data object hierarchy that we want to view in the technical lineage: Database and Schema. We then used the parent and leaf properties to build out the lower levels of the data object hierarchy: Table and Column, respectively.
parent assets represent what we traditionally refer to as the table-level lineage. leaf assets represents what we traditionally refer to as the column-level lineage.
Keep in mind that the property names nodes, parent, and leaf are designed to be non-restrictive, so you can define a hierarchy to reflect the hierarchy of any asset types (similar to the database > schema > table > column hierarchy), including your custom asset types.
Tip For examples of how to configure the props property, as shown in the following code examples, see Using the props property.
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"$defs": {
"assetData": {
"type": "object",
"properties": {
"name": {
"type": "string"
},
"type": {
"type": "string"
}
},
"required": [
"name",
"type"
]
},
"props": {
"type": ["object", "null"],
"properties": {
"fullname": {
"type": "string"
},
"domain_id": {
"type": "string"
}
},
"required": [
"fullname"
]
}
},
"anyOf": [
{
"type": "object",
"properties": {
"nodes": {
"type": "array",
"items": {
"$ref": "#/$defs/assetData"
}
},
"props": {
"$ref": "#/$defs/props"
},
"parent": {
"$ref": "#/$defs/assetData"
},
"leaf": {
"$ref": "#/$defs/assetData"
}
},
"required": [
"nodes",
"parent",
"leaf"
]
},
{
"type": "object",
"properties": {
"nodes": {
"type": "array",
"items": {
"$ref": "#/$defs/assetData"
}
},
"props": {
"$ref": "#/$defs/props"
},
"parent": {
"$ref": "#/$defs/assetData"
}
},
"required": [
"nodes",
"parent"
]
},
{
"type": "object",
"properties": {
"nodes": {
"type": "array",
"items": {
"$ref": "#/$defs/assetData"
}
},
"props": {
"$ref": "#/$defs/props"
}
},
"required": [
"nodes"
]
}
]
}
|
Property |
Description |
|---|---|
|
nodes |
A JSON element in which you specify the highest levels of the hierarchy. In the example code, the nodes specify the hierarchy of GCS File System > GCS Bucket. Example
Copy
|
|
name
|
The name of the node data object. The value is case-sensitive. Case-sensitivity exception
The value of the |
|
type
|
The type of data object of the specified node, for example: Important The values (meaning the asset types) that you specify for this property must match the values that you specify in the |
|
parent |
A lower-level data object in a hierarchy for which the highest levels are specified in the When specifying parent data objects, you also have to include the nodes information, as shown in the following example code. Example
Copy
Important If the Tip Each parent object can contain |
|
name
|
The name of the parent data object. The value is case-sensitive. Case-sensitivity exception
The value of the |
|
type
|
The asset type of the parent data object, for example: Important The values (meaning the asset types) that you specify for this property must match the values that you specify in the |
|
leaf |
The lowest level data object in your hierarchy. The When specifying leaf data objects, you also have to include the nodes and parent information, as shown in the following example code. The names of parents and leaf data objects can be identical if the data objects with the same names are sub-objects of different Example
Copy
|
|
name
|
The name of the leaf data object. The value is case-sensitive. Case-sensitivity exception
The value of the |
|
type
|
The asset type of the leaf data object, for example: Important The values (meaning the asset types) that you specify for this property must match the values that you specify in the |
| props |
This property allows you to specify the full name and domain ID of an asset for the purpose of stitching, regardless of asset type hierarchy. When you add the props property to define the full name of an asset, it applies to the last asset in the array. Tip For examples of how to configure the Important considerations
A word about file processing order and inadvertently specifying the same asset more than onceAssets files and lineage files are processed in the following order: first, all assets files in alphabetical order, followed by all lineage files in alphabetical order. If you choose to specify In other words:
|
|
fullname
|
The full name of the asset in Collibra. The value is case-sensitive. |
|
domain_id
|
The reference ID of the domain in which the asset exists in Collibra. |
Using the props property
The following examples offer some guidance as to when to use the props property and how to configure it.
First off, let's examine why you don't need to use the props property for the traditional Database > Schema > Table > Column hierarchy. Let's say you have an assets file in which you define a leaf kind of asset:
{
"nodes": [{
"name": "Snowflake",
"type": "System"
}, {
"name": "DB1",
"type": "Database"
}, {
"name": "PUBLIC",
"type": "Schema"
}],
"parent": {
"name": "T1",
"type": "Table"
},
"leaf": {
"name": "COL1",
"type": "Column"
}
}
In this case, the full name of the leaf asset (in this case a Column asset) is automatically and correctly constructed as: "snowflake>DB1>PUBLIC>T1>COL1". In fact, for the traditional database type of hierarchy, you don't even need to use asset files, much less the props property.
However, for the following custom hierarchy, you can use the props property to specify the correct full name of the leaf asset, in this case a File asset.
{
"nodes": [{
"name": "gcs",
"type": "GCS File System"
}, {
"name": "bucket1",
"type": "GCS Bucket"
}, {
"name": "/",
"type": "Directory"
}],
"parent": {
"name": "examples",
"type": "Directory"
},
"leaf": {
"name": "data.xls",
"type": "File"
},
"props": {
"fullname": "gcs > bucket1/examples/data.xls",
"domain_id": "<domain in which the file asset resides>"
}
If you don't provide the full name of the leaf asset, it will be constructed using the default traditional formatting (system) > database > schema > table > column. The result would be the full name: "gcs > bucket1 > / > examples > data.xls". However, this is not the correct construction for File assets. The full name provided in the example above ensures the correct construction, so that stitching is achieved.
-
A
propsproperty in anodessection describes the last asset in the array. In this example, you are specifying the full name and domain ID of the Schema asset "SCH1", not the Database asset "DB1". -
Copy
{
"nodes": [{"name":"DB1", "type": "Database"}, {"name": "SCH1", "type": "Schema"}],
"props": {
"fullname": "<full name of the Schema asset>",
"domain_id": "<domain of the Schema asset>"
}
} - If you also want to specify the full name and domain ID of the Database asset, you need two entries, as follows:Copy
{
"nodes": [{"name":"DB1", "type": "Database"}],
"props": {
"fullname": "<full name of the Database asset>",
"domain_id": "<domain of the Database asset>"
}
} - To specify the full name and domain ID of a
parentasset, use thepropsproperty as follows:Copy{
"nodes": [{"name":"DB1", "type": "Database"}, {"name": "SCH1", "type": "Schema"}],
"parent": {"name": "TB1", "type": "Table"},
"props": {
"fullname": "<full name of the Table asset>",
"domain_id": "<domain of the Table asset>"
}
} - To specify the full name and domain ID of a
leafasset, use thepropsproperty as follows:Copy{
"nodes": [{"name":"DB1", "type": "Database"}, {"name": "SCH1", "type": "Schema"}],
"parent": {"name": "TB1", "type": "Table"},
"leaf": {"name": "COL1", "type": "Column"},
"props": {
"fullname": "<full name of the Column asset>",
"domain_id": "<domain of the Column asset>"
}
}
There are two ways to find this information.
Via API
You can use the Find assets API, as documented in the Collibra Developer Portal. In the body of the response, look for the following details in the results array:
name: This is the full name of the asset.domain > id: This is the unique reference ID of the domain in which the asset is located.
Via Data Catalog
- To find the full name, open the relevant asset page, and then click Actions > Edit. The Edit dialog box appears. The full name is shown in the Name field.
- To find the domain ID, open the relevant domain in Collibra. In the following example URL, the reference ID is in bold:
https://<yourcollibrainstance>/domain/22258f64-40b6-4b16-9c08-c95f8ec0da26?view=00000000-0000-0000-0000-000000040001.
Lineage files
You can have one or more lineage files in the folder. The names of your lineage files have to follow the format lineage<something-unique>.json.
You use the lineage file to define the lineage relation between two or more data objects. The lineage relations are shown as edges in the technical lineage graph. The edges represent the data flow from a source to a target.
This section contains the path from a source to a target and defines the transformation code or transformation references to be processed by the Collibra Data Lineage service.
useCollibraSystemName property in your lineage harvester (deprecated) configuration file is set to true, the system data object is used to stitch to the System asset in Data Catalog. If the useCollibraSystemName property is set to false in your lineage harvester (deprecated) configuration file, do not specify the system data object in these files, or else stitching will fail.{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"$defs": {
"assetData": {
"type": "object",
"properties": {
"name": {
"type": "string"
},
"type": {
"type": "string"
}
},
"additionalProperties": false,
"required": [
"name",
"type"
]
},
"props": {
"type": [
"object",
"null"
],
"properties": {
"fullname": {
"type": "string"
},
"domain_id": {
"type": "string",
"format": "uuid"
}
},
"additionalProperties": false,
"required": [
"fullname"
]
}
},
"type": "object",
"properties": {
"src": {
"anyOf": [
{
"type": "object",
"properties": {
"nodes": {
"type": "array",
"items": {
"$ref": "#/$defs/assetData"
}
},
"parent": {
"$ref": "#/$defs/assetData"
},
"leaf": {
"$ref": "#/$defs/assetData"
},
"props": {
"$ref": "#/$defs/props"
}
},
"additionalProperties": false,
"required": [
"nodes",
"parent",
"leaf"
]
},
{
"type": "object",
"properties": {
"nodes": {
"type": "array",
"items": {
"$ref": "#/$defs/assetData"
}
},
"parent": {
"$ref": "#/$defs/assetData"
},
"props": {
"$ref": "#/$defs/props"
}
},
"additionalProperties": false,
"required": [
"nodes",
"parent"
]
}
]
},
"trg": {
"anyOf": [
{
"type": "object",
"properties": {
"nodes": {
"type": "array",
"items": {
"$ref": "#/$defs/assetData"
}
},
"parent": {
"$ref": "#/$defs/assetData"
},
"leaf": {
"$ref": "#/$defs/assetData"
},
"props": {
"$ref": "#/$defs/props"
}
},
"additionalProperties": false,
"required": [
"nodes",
"parent",
"leaf"
]
},
{
"type": "object",
"properties": {
"nodes": {
"type": "array",
"items": {
"$ref": "#/$defs/assetData"
}
},
"parent": {
"$ref": "#/$defs/assetData"
},
"props": {
"$ref": "#/$defs/props"
}
},
"additionalProperties": false,
"required": [
"nodes",
"parent"
]
}
]
},
"source_code": {
"type": "object",
"properties": {
"path": {
"type": "string"
},
"highlights": {
"type": [
"array",
"null"
],
"items": {
"type": "object",
"properties": {
"start": {
"type": "integer"
},
"len": {
"type": "integer"
}
},
"additionalProperties": false,
"required": [
"len",
"start"
]
}
},
"transformation_display_name": {
"anyOf": [
{
"type": "integer"
},
{
"type": "string"
}
]
}
},
"additionalProperties": false,
"required": [
"path"
]
}
},
"additionalProperties": false,
"required": [
"src",
"trg"
]
}
[
{
"src": {
"nodes": [{"name":"DB1", "type": "Database"}, {"name": "SCH1", "type": "Schema"}],
"parent": {"name": "TB1", "type": "Table"},
"leaf": {"name": "COL1", "type": "Column"},
"props": {
"fullname": "<full name of the leaf asset>",
"domain_id": "<domain of the leaf asset>"
},
},
"trg": {
"nodes": [{"name":"DB1", "type": "Database"}, {"name": "SCH1", "type": "Schema"}],
"parent": {"name": "TB2", "type": "Table"},
"props": {
"fullname": "<full name of the parent asset>",
"domain_id": "<domain of the parent asset>"
},
"source_code" : {
"path": "<folder name>/sc1.sql",
"highlights": [{"start": 71, "len": 69 }, ...],
"transformation_display_name": "middle bubble"
}
}
}
]
|
Properties |
Description |
|---|---|
|
src
|
The hierarchical path to the source data object. This property represents where the data comes from for a transformation. Important The source of a lineage can only be a parent or a leaf. Example
Copy
|
|
trg
|
The hierarchical path to the target data object. This property represents where the data flows to. Important The target can be a parent or a leaf; however, if the source is a parent, the target must be a parent. Tip If the target asset is a parent asset and the source asset is a leaf asset, we refer to the lineage as "indirect lineage". If the target asset is a parent asset and the source asset is a parent asset, we refer to the lineage as "table-level lineage". Example
Copy
|
|
props
|
An optional property that allows you to specify the full name and domain of an asset, for the purpose of stitching. This property is not required for Database, Schema, Table and Column asset types.
A word about file processing order and inadvertently specifying the same asset more than onceAssets files and lineage files are processed in the following order: first, all assets files in alphabetical order, followed by all lineage files in alphabetical order. If you choose to specify In other words:
|
|
source_code
|
The transformation code that determines how the technical lineage is constructed. This can be a descriptive string or a SQL statement that manipulates data. This section is optional. |
|
path
|
The path and name of the source code file that contains the transformation code. The path relative to the source_codes folder, which is in the same folder as the lineage JSON files. |
|
highlights
|
This optional property identifies a string of transformation code in a source code file to be highlighted in the source code pane at the bottom part of the technical lineage graph. The entire lines that include the transformation code are highlighted. The string must be a subset of the string of transformation code that is defined by the |
|
start
|
The start position of the string of the transformation code to be highlighted. The start position is in characters, not bytes. |
|
len
|
The length of the string of the transformation code to be highlighted. The length is in characters, not bytes. |
|
transformation_display_name
|
The name of the transformation when looking at the transformations view in the technical lineage viewer. |
Source codes subfolder and files
You can provide a subfolder of source code files that define the transformation details. The source code folder and your JSON files must be in the CUSTOM_LINEAGE folder, along with the JSON files. If it's not, an error occurs indicating that the lineage harvester cannot find the source code files.
The source code paths are relative to the CUSTOM_LINEAGE folder.
- source_codes/sc1.sql
- source_codes/another-subfolder/sc2.sql
./. The following will fail:- source_codes/./sc1.sql
What happens if you choose not to provide source code files
If you are using the lineage harvester and there are no source code files to analyze, the batch stats are empty, as shown below. The lineage relations are still created, but because batch stats are directly linked to the source codes, if source code files are not provided, this is expected.
Batch stats: Parsing errors: 0 Analysis errors: 0 Done: 1
The Done: 1 result is a dummy entry, so that the source appears in the Sources tab page.
Example JSON files
For some example JSON files, go to Custom technical lineage JSON file examples.
If you opt for the single-file definition option, you use a lineage.json file to define the lineage between two or more data objects, and optionally include transformations details to create the custom technical lineage.
The following sections in the JSON file define different parts in the resulting Collibra technical lineage graph:
tree, which defines the data object hierarchy. The data objects are shown as nodes in the technical lineage graph.lineages, which defines the lineage relation. The lineage relations are shown as edges in the technical lineage graph. The edges represent the data flow from a source to a target.codebase_files, which points to the source code files that include transformation details.
To create a simple custom technical lineage, you need to include assets and lineages sections in your JSON file. You can add the transformation code in the lineages section.
To create an advanced custom technical lineage, you need to include assets, lineages and codebase_files sections in your JSON file. You add references to the transformation code in source code files in the codebase_files section.
Transformation code in both simple and advanced custom technical lineages is shown in the source code pane at the bottom part of the technical lineage graph.
Requirements and restrictions
The source code files must be in the same directory as the lineage.json file. Otherwise, an error occurs indicating that the lineage harvester (deprecated) cannot find the source code files.
|
Sections |
Description |
|---|---|
|
version |
The version of the JSON architecture. Specify the value of |
|
This section contains tree definitions of data objects between which lineages can be defined. The data objects are systems, databases, schemas, tables, views, columns, dashboards and reports. Each node of a tree contains the name, type and optionally children or leaves properties which form a hierarchy of data objects. You must define a node only once in this section. With the nested tree format, you can reuse the properties of one node for multiple children. For example, you can define a database once and use the Tip Usually, the structure you map is the following: system > database > schema > table > column. The system is optional, unless the Important If the |
|
| lineages |
This section contains the path from a source to a target and defines the transformation code or transformation references to be processed by the Collibra Data Lineage service. Important If the |
|
codebase_files |
This optional section defines the reference to source code files. Store the source code files that contain the transformation code in the same directory as the lineage.json file. Include this section only when you create an advanced custom technical lineage. |
|
Properties |
Description |
|---|---|
|
name
|
The name of your data object. Specify this property with the system name, database name, schema name, table name, view name or column name. The following rules apply when you specify this property:
|
|
type
|
The type of your data object. You can specify one of the following options: If the
useCollibraSystemName property in your lineage harvester (deprecated) configuration file is set to true, the system data object is used to stitch to the System asset in Data Catalog. If the useCollibraSystemName property is set to false in your lineage harvester (deprecated) configuration file, do not specify the system data object in this section, or else stitching will fail. |
|
children
|
The sub-objects that have a hierarchical relation to the defined data object. Each child can contain For example, you can use the Each child and leave have the |
|
leaves
|
The sub-objects of an object that is defined in a A technical lineage is defined as relations between leaf nodes of the tree. The value of the |
|
Properties |
Required | Description |
|---|---|---|
|
src_path
|
Yes |
The hierarchical path to the source data object. This data object is defined as a leaf in the This property represents where the data comes from for a transformation. |
|
trg_path
|
Yes |
The hierarchical path to the target data object. This data object is defined as a leaf in the This property represents where the data flows to. |
|
<data objects>
|
Yes |
An ordered array of data object names. This array is required to define the sub-objects of the Specify the array with the data object names that start from the top of the This example shows data objects that can be stitched: system > database > schema > table > column. This example shows data objects that cannot be stitched: dashboard > report > column. If the
useCollibraSystemName property in your lineage harvester (deprecated) configuration file is set to true, the system data object is used to stitch to the System asset in Data Catalog. If the useCollibraSystemName property is set to false in your lineage harvester (deprecated) configuration file, do not specify the system data object in this section, or else stitching will fail. |
|
mapping
|
Yes Simple custom technical lineage only |
The mapping name. This property specifies a name for the transformation code. |
|
source_code
|
Yes Simple custom technical lineage only |
The transformation code, which determines how the technical lineage is constructed. The transformation code can be a descriptive string or a SQL statement that manipulates data. |
|
mapping_ref
|
No Advanced custom technical lineage only |
This property contains the name of the mapping reference to the transformation code in source code files. This property also contains the position and length of the transformation code to be highlighted in the technical lineage graph. |
|
source_code
|
No Advanced custom technical lineage only |
The name of the source code file that contains the transformation code. The transformation code can be a SQL statement, code that manipulates data or a descriptive string.
|
|
mapping
|
No Advanced custom technical lineage only |
The unique descriptor of a part of transformation code in a source code file that is in the same directory as the lineage.json file. A source code file can contain different parts of transformation code that represent different data flows. This property indicates the referenced data flow. The value of this property is the same as the value of the |
|
codebase_pos
|
No Advanced custom technical lineage only |
The positions indicate a string of the transformation code in a source code file to be highlighted in the bottom part of the Collibra technical lineage graph. The whole lines that include the transformation code are highlighted. The string must be a subset of the string of the transformation code that is defined by the |
|
pos_start
|
No Advanced custom technical lineage only |
The start position of the string of the transformation code to be highlighted. The start position is in characters, not bytes. The value must be equal to or greater than the value of the |
|
pos_len
|
No Advanced custom technical lineage only |
The length of the string of the transformation code to be highlighted. The length is in characters, not bytes. Specify a value in the following range:
For example, if you specify |
|
Properties |
Description |
|---|---|
|
<source code path>
|
The file path to source code files that contain the transformation code. The transformation code can be a SQL statement or code that manipulates data.
|
|
The mapping of the transformation code and the position of the transformation code that is shown in the bottom part of the technical lineage graph. This property defines a string of the transformation code in the source code file to be shown in the technical lineage graph. The string must include the string that is defined by the |
|
|
<mapping>
|
The unique descriptor of a part of transformation code in a source code file that is in the same directory as the lineage.json file. A source code file can contain different parts of transformation code that represent different data flows. This property indicates the referenced data flow. The value must match the value of the |
|
pos_start
|
The start position of the string of the transformation code. The start position is in characters, not bytes. Specify a value in the following range:
|
|
pos_len
|
The length of the string of the transformation code. The length is in characters, not bytes. Specify a value in the following range:
For example, if you specify |
Programming considerations
| Consideration | Details |
|---|---|
| Indirect and table-level lineage |
Currently, there is no native support for indirect and table-level lineages. As a workaround, you can specify |
| Many-to-one column lineage |
The fundamental unit of definition within the custom technical lineage JSON file is a single source-to-target mapping. Collibra Data Lineage requires a one-to-one relationship definition between the source asset and the target asset for each entry. To represent a many-to-one relationship, such as when multiple source columns contribute to a single target column (for example: Source A, Source B, Source C → Target Column D), you must define individual lineage entries for each source-to-target pair. |
Example
For some example JSON files, go to Custom technical lineage JSON file examples.