Prepare the lineage harvester configuration file for AWS Glue

You have to prepare a technical lineage configuration file before you run the lineage harvester. The lineage harvester collects your AWS Glue script annotations and sends them to the Collibra Data Lineage server, where they are processed and analyzed.

Warning You cannot create a technical lineage for AWS Glue script annotations by synchronizing Amazon S3. You have to prepare the lineage harvester configuration file and run the lineage harvester to see the technical lineage for AWS Glue script annotations.

Prerequisites

You have the lineage harvester 1.4.0 or newer.
You have a global role that has the Manage all resources global permission.
You have a global role with the Catalog global permission, for example Catalog Author.
You have a global role with the Technical lineage global permission.
You have downloaded the lineage harvester and you have the necessary system requirements to run it.

Steps

Run the following command line to start the lineage harvester:
- Windows: .\bin\lineage-harvester.bat
- for other operating systems: chmod +x bin/lineage-harvester and then bin/lineage-harvester
An empty configuration file is created in the lineage harvester config folder.

Open the lineage-harvester.conf file and enter the values for each property.

Properties	Description
general	This section describes the connection information between the lineage harvester and Data Catalog.
catalog	This section contains information that is necessary to connect to Data Catalog.
url	The URL of your Collibra Data Intelligence Cloud environment. Note You have to enter the public URL of your Collibra DGC environment. Other URLs will not work.
username	The username that you use to sign in to Collibra.
useCollibraSystemName	Indication whether you want to use the system or server name of a JDBC data source to match to the System asset you created when you prepared the physical data layer. This is useful when you have multiple databases with the same name. By default, the `useCollibraSystemName` property is set to `false`. If you want to use it, set it to `true`. If you keep the property set to `false`, the lineage harvester ignores the `collibraSystemName` property in the rest of the configuration file. If you set the `useCollibraSystemName` property to `true`, the lineage harvester reads the value in the collibraSystemName property in all sections of the configuration file and in the AWS Glue <source ID> configuration file. Warning Unless you have multiple databases with the same name, we highly recommend that you keep the default value.
sources	This section contains the required information for AWS Glue.
id	The unique ID that is used to identify the data source on the Collibra Data Lineage server. For example, my_aws_glue.
type	The kind of data source. In this case, the value has to be AwsGlue.
region	The AWS region the lineage harvester connects to. For example, eu-west-3. Note See the AWS documentation for a list of all AWS locations.
awsAccessKeyId	The access key ID of the programmatic AWS user.
collibraSystemName	The system name of AWS Glue.

Save the configuration file.
Start the lineage harvester again in the console and run the following command:
- for Windows: .\bin\lineage-harvester.bat full-sync
- for other operating systems: ./bin/lineage-harvester full-sync
When prompted, enter the password or secret access key to connect to your Collibra Data Intelligence Cloud and AWS environment.
The passwords are encrypted and stored in /config/pwd.conf.

Example

{
 "general": {
  "catalog": {
   "url": "https://<organization>.collibra.com",
   "userName": "<your-collibra-username>"
  },
  "useCollibraSystemName": false
 },
 "sources": {
  "type": "AwsGlue",
  "id": "aws-glue_source",
  "region": "us-east-2",
  "awsAcessKeyId": "my-AWS-Glue-access-key",
  "collibraSystemName": "AWS-Glue-system"
 }
}

What's next?

The lineage harvester sends the AWS Glue script annotations to the Collibra Data Lineage server. Data Catalog then imports the technical lineage.