Prepare the lineage harvester configuration file for AWS Glue
You have to prepare a technical lineage configuration file before you run the lineage harvester. The lineage harvester collects your AWS Glue script annotations and sends them to the Collibra Data Lineage server, where they are processed and analyzed.
Warning You cannot create a technical lineage for AWS Glue script annotations by synchronizing Amazon S3. You have to prepare the lineage harvester configuration file and run the lineage harvester to see the technical lineage for AWS Glue script annotations.
Prerequisites
- You have the lineage harvester 1.4.0 or newer.
- You have a global role that has the Manage all resources global permission.
- You have a global role with the Catalog global permission, for example Catalog Author.
- You have a global role with the Technical lineage global permission.
- You have downloaded the lineage harvester and you have the necessary system requirements to run it.
Steps
- Run the following command line to start the lineage harvester:
- Windows: .
\bin\lineage-harvester.bat - for other operating systems:
chmod +x bin/lineage-harvesterand thenbin/lineage-harvester
An empty configuration file is created in the lineage harvester config folder. - Windows: .
-
Open the lineage-harvester.conf file and enter the values for each property.
Properties Description general This section describes the connection information between the lineage harvester and Data Catalog.
catalogThis section contains information that is necessary to connect to Data Catalog.
urlThe URL of your Collibra Data Intelligence Cloud environment.
Note You have to enter the public URL of your Collibra DGC environment. Other URLs will not work.
usernameThe username that you use to sign in to Collibra.
useCollibraSystemNameIndication whether you want to use the system or server name of a JDBC data source to match to the System asset you created when you prepared the physical data layer. This is useful when you have multiple databases with the same name.
By default, the
useCollibraSystemNameproperty is set tofalse. If you want to use it, set it totrue.- If you keep the property set to
false, the lineage harvester ignores thecollibraSystemNameproperty in the rest of the configuration file. - If you set the
useCollibraSystemNameproperty totrue, the lineage harvester reads the value in the collibraSystemName property in all sections of the configuration file and in the AWS Glue <source ID> configuration file.
Warning Unless you have multiple databases with the same name, we highly recommend that you keep the default value.
sources This section contains the required information for AWS Glue.
idThe unique ID that is used to identify the data source on the Collibra Data Lineage server. For example, my_aws_glue.
typeThe kind of data source. In this case, the value has to be AwsGlue.
regionThe AWS region the lineage harvester connects to. For example, eu-west-3.
Note See the AWS documentation for a list of all AWS locations.
awsAccessKeyIdThe access key ID of the programmatic AWS user.
collibraSystemNameThe system name of AWS Glue.
- If you keep the property set to
- Save the configuration file.
- Start the lineage harvester again in the console and run the following command:
- for Windows:
.\bin\lineage-harvester.bat full-sync - for other operating systems:
./bin/lineage-harvester full-sync
- for Windows:
- When prompted, enter the password or secret access key to connect to your Collibra Data Intelligence Cloud and AWS environment.The passwords are encrypted and stored in /config/pwd.conf.
Example
{
"general": {
"catalog": {
"url": "https://<organization>.collibra.com",
"userName": "<your-collibra-username>"
},
"useCollibraSystemName": false
},
"sources": {
"type": "AwsGlue",
"id": "aws-glue_source",
"region": "us-east-2",
"awsAcessKeyId": "my-AWS-Glue-access-key",
"collibraSystemName": "AWS-Glue-system"
}
}
What's next?
The lineage harvester sends the AWS Glue script annotations to the Collibra Data Lineage server. Data Catalog then imports the technical lineage.