Prepare the lineage harvester configuration file for AWS Glue

You have to prepare a technical lineage configuration file before you run the lineage harvester. The lineage harvester collects your AWS Glue script annotations and sends them to the Collibra Data Lineage server, where they are processed and analyzed.

Warning You cannot create a technical lineage for AWS Glue script annotations by synchronizing Amazon S3. You have to prepare the lineage harvester configuration file and run the lineage harvester to see the technical lineage for AWS Glue script annotations.

Prerequisites

Steps

  1. Run the following command line to start the lineage harvester:
    • Windows: .\bin\lineage-harvester.bat
    • for other operating systems: chmod +x bin/lineage-harvester and then bin/lineage-harvester
    An empty configuration file is created in the lineage harvester config folder.
  2. Open the lineage-harvester.conf file and enter the values for each property.
    PropertiesDescription
    general

    This section describes the connection information between the lineage harvester and Data Catalog.

    catalog

    This section contains information that is necessary to connect to Data Catalog.

    url

    The URL of your Collibra Data Intelligence Cloud environment.

    Note You have to enter the public URL of your Collibra DGC environment. Other URLs will not work.

    username

    The username that you use to sign in to Collibra.

    useCollibraSystemName

    Indication whether you want to use the system or server name of a JDBC data source to match to the System asset you created when you prepared the physical data layer. This is useful when you have multiple databases with the same name.

    By default, the useCollibraSystemName property is set to false. If you want to use it, set it to true.

    • If you keep the property set to false, the lineage harvester ignores the collibraSystemName property in the rest of the configuration file.
    • If you set the useCollibraSystemName property to true, the lineage harvester reads the value in the collibraSystemName property in all sections of the configuration file and in the AWS Glue <source ID> configuration file.

    Warning Unless you have multiple databases with the same name, we highly recommend that you keep the default value.

    sources

    This section contains the required information for AWS Glue.

    id

    The unique ID that is used to identify the data source on the Collibra Data Lineage server. For example, my_aws_glue.

    type

    The kind of data source. In this case, the value has to be AwsGlue.

    region

    The AWS region the lineage harvester connects to. For example, eu-west-3.

    Note See the AWS documentation for a list of all AWS locations.

    awsAccessKeyId

    The access key ID of the programmatic AWS user.

    collibraSystemName

    The system name of AWS Glue.

  3. Save the configuration file.
  4. Start the lineage harvester again in the console and run the following command:
    • for Windows: .\bin\lineage-harvester.bat full-sync
    • for other operating systems: ./bin/lineage-harvester full-sync
  5. When prompted, enter the password or secret access key to connect to your Collibra Data Intelligence Cloud and AWS environment.
    The passwords are encrypted and stored in /config/pwd.conf.

Example

{
 "general": {
  "catalog": {
   "url": "https://<organization>.collibra.com",
   "userName": "<your-collibra-username>"
  },
  "useCollibraSystemName": false
 },
 "sources": {
  "type": "AwsGlue",
  "id": "aws-glue_source",
  "region": "us-east-2",
  "awsAcessKeyId": "my-AWS-Glue-access-key",
  "collibraSystemName": "AWS-Glue-system"
 }
}

What's next?

The lineage harvester sends the AWS Glue script annotations to the Collibra Data Lineage server. Data Catalog then imports the technical lineage.