Sharing database models across data sources
Sharing database models allows you to provide table-definition details from an independent data source to a data source that is dependent on those details. This is needed to avoid analysis errors and to have a complete lineage that includes lineage from the SQL statements from dependent data sources.
You can use Collibra Data Lineage to create a technical lineage for several data sources. In each data source, a data model defines the structure of tables in a database. Data Definition Language (DDL) is the language used to create the structure. When metadata is harvested from a data source, we also extract the DDL file.
DDL files allow you to create and modify the structure of objects in a database. In the context of data lineage, we are interested in the structure of tables. Commonly used DDL in SQL querying are CREATE, ALTER, DROP, and TRUNCATE. The CREATE command, for example, creates a new table.
Dependent data sources
What are dependent data sources? Let's consider an example. Let's say that you want to create a technical lineage for two data sources:
- Database1 contains the DDL that specifies that the database has a table named "Table1", which has three columns: Col1, Col2, Col3, and Col4.
-
Database2 contains an SQL statement:
SELECT * from Database1.Schema1.Table1
.
The SQL statement in Database2 refers to the table in Database1. Therefore, to get lineage from the statement in Database2, the table definition from Database1 must be known. In this case, we say that Database2 is dependent on Database1. Database1 is considered the independent data source.
To have complete lineage and avoid analysis errors, you can specify the dependency in Edge. This allows the database model of Database1 to be shared with Database2.
- An analyze error is raised, prompting you to provide the DDL file.
- The only workaround is to consolidate your SQL statements and DDL file in a single data source.
Example 1
In this example, you specify Source 1 and Source 2 as dependent data sources. However, as shown below, you have a database named Database 1 in two different systems. This will fail because when analyzing an SQL statement, there is no way of knowing to which system Database 1 should be attributed.
Source 1
System A --> Database 1
System A --> Database 2
Source 2
System B --> Database 1
Example 2
Again you specify Source 1 and Source 2 as dependent data sources. This scenario is fine because Database 1 exists only in System A.
Source 1
System A --> Database 1
System A --> Database 2
Source 2
System A --> Database 1
How to specify dependent sources
Database model-sharing is available via both Edge and the lineage harvester.
Specify dependent sources in your Edge capability
You can specify data source dependencies in the Dependent On Sources field in your Edge capability. In our example scenario above, in which Database2 is dependent on Database1, you specify the Source ID of Database1 in the Dependent On Sources field when you configure the Edge capability for Database2.
If, for example, Database2 is dependent on more than one data source, click Add property to add another field, and then specify the Source ID of the next data source.
Specify dependent sources in your lineage harvester configuration file
Note Database model-sharing is available via lineage harvester 2023.11 and newer.
You can specify data source dependencies by including the dependentSourceIds
property in your lineage harvester configuration file.
If Database2 is dependent on Database1, include the dependentSourceIds
property and specify the Source ID of Database1:
"dependentSourceIds": ["<source ID of Database1>"]
If, for example, Database2 is dependent on more than one data source, specify the independent sources, as follows:
"dependentSourceIds": ["<source ID of an independent source>", "<source ID of another independent source>"]
{ "general": { "catalog" : { "url" : "https://companydomain.collibra.com", "username" : "my-Collibra-username" }, "useCollibraSystemName" : false }, "sources" : [ { "id": "oracle-id", "type": "DatabaseOracle", "hostname": "host_url", "username": "user1", "collibraSystemName": "automation_csn", "port": 1521, "serviceNames": ["sn1", "sn2"], "databaseNames": ["db1", "db2"], "dependentSourceIds": ["source ID of Database1"], "deleteRawMetadataAfterProcessing": false } ] }
Synchronize independent sources first
Note The information in this section is only applicable if you're using Edge. If you use the lineage harvester and all relevant sources are mentioned in the same lineage harvester configuration file, then Collibra Data Lineage automatically coordinates the correct order of analysis and synchronization.
If you have a dependent data source, it's important that the analysis of each independent source is complete before you synchronize the dependent source. That's because the analysis of each independent source is needed to analyze the dependent source.
If the analysis of an independent source is not complete when you synchronize the dependent source, then:
- If the dependent source is SQL data source, analysis errors will be produced and the lineage will be incomplete.
- If the dependent source is a BI or ETL tool, a dummy column (*) is shown in the technical lineage graph, as the actual columns of the table are not known.
Examples
The following examples show what happens:
- If a particular database model is not shared with a data source that is dependent on it.
- After specifying the dependency in the Edge capacity of the dependent data source.
Data source type | Details |
---|---|
SQL |
If the database model is not shared, analysis errors occur:
After specifying the dependency, the analysis errors are resolved.
|
BI or ETL tool |
If the database model is not shared, a dummy column (*) is shown in the technical lineage graph.
After specifying the dependency, the columns are shown in the technical lineage graph.
|