Transformation logic and common errors for DataStage
Use this topic to understand how Collibra Data Lineage extracts lineage and resolves parameters from DataStage, based on whether jobs are specified in the source configuration. This topic also lists the common errors you might encounter when creating technical lineage for DataStage and provides possible solutions to each error.
Transformation logic and parameter resolution
In DataStage, you can have multiple jobs, and jobs can call other jobs. Collibra Data Lineage extracts lineage based on whether jobs are specified in the source configuration. Additionally, parameter resolution depends on whether parameter names and values are specified for the jobs in the source configuration.
If you use technical lineage via Edge, specify jobs, parameter names, and values in the Source Configuration field in the technical lineage for DataStage capability. If you use the lineage harvester, specify them in the source ID file.
Consider the following sequence of jobs as an example. In this example, only Job1 and Job4 are ran as parent jobs in DataStage.
For jobs not configured:
If you do not specify jobs in the source configuration, Collibra Data Lineage considers all jobs are ran as parent jobs in DataStage, including the jobs that are called. As a result, Collibra Data Lineage extracts lineage with the following sequences. When combined with the parameter resolution process, this approach might lead to incorrect or duplicate technical lineage.
For jobs configured:
If you specify jobs in the source configuration, Collibra Data Lineage extracts lineage only from the specified jobs and the jobs that are called by the specified jobs. Take the following source configuration as an example:
"jobs": [ "job1", "job4" ]
In this case, Collibra Data Lineage extracts lineage for Job1, Job4, and the jobs that Job1 and Job4 called:
Parameter resolution
Input parameters in jobs can directly impact the extracted lineage. For example, a parameter that influences an SQL statement can affect how lineage is determined.
Collibra Data Lineage resolves parameters in a specific order, depending on whether the job is a parent job or a job called by another job.
In the following example, job5 is called by parent job Job4.
For jobs that are called by another job:
Collibra Data Lineage resolves the value for j5_param1
in Job5 in the following order:
- Use the value from the parent job, if available. If a DataStage expression is detected,
DATASTAGE EXPRESSION
is used as the value. - Use the value specified in the
perJobParameters
property for this specific job in the source configuration, if available. - Use the value specified in the
jobParameters
property in the source configuration, if available. - Use the default value.
For parent jobs:
Collibra Data Lineage resolves the values for j4_param1
and j4_param2
in parent job job4 in the following order:
- Use the value specified in the
perJobParameters
property for this specific job in the source configuration, if available. - Use the value specified in the
jobParameters
property in the source configuration, if available. - Use the default value.
In the following example, j4_param1
resolves to value for j4_param1
, and j4_param2
resolves to specific job value for j4_param2
, because the perJobParameters
property takes precedence over the jobParameters
property.
"jobParameters": [ { "name": "j4_param1", "value": "value for j4_param1" }, { "name": "j4_param2", "value": "general value for j4_param2" } ], "perJobParameters": { "Job4": [ { "name": "j4_param2", "value": "specific job value for j4_param2" } ] }
Recursive lookup
Collibra Data Lineage supports recursive lookup, so you can specify a parameter reference in the perJobParameters
or jobParameters
property.
For example, if you specify the jobParameters
property as follows, Collibra Data Lineage looks up the value for the #another_param
parameter recursively until an actual value is resolved. In this case, the actual value is final value
.
"jobParameters": [ { "name": "j5_param1" "value": "#another_param" } { "name": "another_param", "value": "final value" } ]
Parameter set resolution
If a parameter set is used in DataStage, ensure that you specify the value
property in perJobParameters
or jobParameters
property with the value file name.
The following graphic shows an example of the value file names.
The following example shows that the valueFileName1
value file name is specified for the parameterSet1
parameter set.
"jobParameters": [ { "name": "parameterSet1", "value": "valueFileName1" } ]
When you specify the value
property with a value file name, the value is resolved in the following order:
- Use the value from the parent job, if available.
- Use the value specified in the
perJobParameters
property for this specific job in the source configuration, if available. - Use the value specified in the
jobParameters
property in the source configuration, if available. - When the Default Value is set to
(As pre-defined)
, use the pre-defined value from the parent job. - Use the default value.
User variable resolution
User variables can contain DataStage expressions. When Collibra Data Lineage detects a DataStage expression, the value is resolved in the following order:
- Use the value specified in the
perJobParameters
property for this specific job in the source configuration, if available. - Use the value specified in the
jobParameters
property in the source configuration, if available. - Use
DATASTAGE EXPRESSION
as the value.
The following code example shows how to use the user variable.
"jobParameters": [ { "name": "UserVariable_Activity_5.sql_query", "value": "SELECT COL1, COL2 FROM TABLE1;" } ]
You can specify parameter set references with user variables.
"jobParameters": [ { "name": "UserVariable_Activity_5.sql_query", "value": "#Parameter_Set_SQL_Statements.SrcQuery#" } ]
Analyze errors and possible solutions
After you create technical lineage for DataStage, you might encounter the following errors in the analysis results in the technical lineage Sources tab page.
When you review the analysis results of DataStage, you might encounter the following error message: Please define 'UNDEFINED' values and fill 'UNRESOLVED' parameters in your configuration file. This error occurs in the following scenarios:
- If you use technical lineage via Edge to create technical lineage, certain properties were not specified in the Source Configuration field in the Technical Lineage for DataStage capability.
- If you use the lineage harvester to create technical lineage, certain properties were not specified in the source ID file.
The unspecified properties are marked as UNDEFINED
in the transformation details. To resolve this issue, specify the properties.
When you review the analysis results of DataStage, you might encounter a DATASTAGE_EXPRESSION substitution even when there is no error.
CollibraData Lineage can parse and create technical lineage for complex formats with advanced functions and statements, such as Trim, Field, and If...Then...Else. When such a complex expression is parsed, it is substituted with the string "DATASTAGE_EXPRESSION".
No action is required for this message.
For more information, go to Expression format in IBM InfoSphere Information Server documentation.