AWS Glue: Set up OpenLineage Apache Spark integration for Cloud Storage connections

Use this procedure to set up the Apache Spark integration to emit OpenLineage messages to Fluentd and save the resulting files to a location accessible by Collibra. For more information, go to Apache Spark in the OpenLineage documentation.

Download and install OpenLineage Spark integration:
1. Download the JAR file from Maven Central, and select the latest available version of the openlineage-spark package.
2. Install the JAR file by adding it to the list of dependent JARs in your AWS Glue job.
Set the properties in the SparkSession. You can set the properties in different ways. For details, go to Apache Spark Configuration Usage and Spark Config Parameters.
Note Avoid using the file transport type to save data to an S3 bucket. S3 buckets store immutable objects, not files, which means that Java file operations will not function.

Verify the integration. To ensure that the Apache Spark integration works, you can send events to console and view them in CloudWatch.

The following example shows the configuration to send the data to the console. When the configuration is set, AWS Glue jobs are mapped to a log group under the CloudWatch service. Each job execution generates a job run ID. The logs associated with the job run can be found in this log group. If the integration works, the logs will contain JSON with OpenLineage information.

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from pyspark.conf import SparkConf
from awsglue.context import GlueContext
from awsglue.job import Job
							
conf = SparkConf()
conf.set("spark.extraListeners", "io.openlineage.spark.agent.OpenLineageSparkListener")\
    .set("spark.jars.packages", "io.openlineage:openlineage-spark:1.8.0")\
    .set("spark.openlineage.version", "v1")\
    .set("spark.openlineage.namespace", "OL_EXAMPLE")\
    .set("spark.openlineage.transport.type", "console")
							
sc = SparkContext.getOrCreate(conf=conf)
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)

Integrate with Fluentd. After you verify that the lineage events are generated in the console, update the configuration to direct the metadata to your Fluentd collector.
For additional configuration patterns, go to Examples in the Transport topic in the OpenLineage documentation. Use the Spark Config examples and specify the URL and port you defined when setting up Fluentd.
After Fluentd saves the lineage messages to files, copy the files in OpenLineage format to the relevant directory in your cloud-based storage system. The files must be in one of the following:
- An AWS S3 bucket.
- An Azure Data Lake Storage container.
- A Google Cloud Storage bucket.
Note Whenever you synchronize lineage, you must upload all source files you want to include in the technical lineage graph.

What's next

You can now set up Fluentd and prepare the data source files