Airflow: Set up Fluentd and prepare files for cloud storage

After you configure your software to emit OpenLineage messages, you can use Fluentd to collect these messages. Fleuntd is the data collector that the OpenLineage community prefers.

Steps

  1. Complete the following steps to install Fluentd:
      1. Determine the host and port where the Fluentd collector will run. This daemon needs to be up and running continuously to collect metadata.
      2. Open a port for HTTP so that the data source can send REST API events to this location. Ensure that the port is open for listening, and document the port number and host.

        You can use HTTPS for improved security. Generate an SSL certificate and configure Fluentd when using HTTPS. For more information, go to http in Flentd documentation.

      3. Download the Fluent package by following the guide for your system. For example, if you use Red Hat Linux, go to Install by RPM Package (Red Hat Linux) in the Fluentd documentation.
      4. Start the service and check the status. For example, you can enter the following command:
        - sudo systemctl start fluentd.service; sudo systemctl status fluentd.service
      5. Check the logs to ensure the service is running correctly. For example, if you use Red Hat Linux, you can find the log files at /var/log/fluent/fluentd.log.
      6. Test basic functionality by completing the following steps. Red Hat Linux is used as an example.
        1. The default configuration is set to receive logs at an HTTP endpoint on port 8888 and route them to stdout. On Red Hat Linux, the default configuration is at /etc/fluent/fluentd.conf.
        2. Enter the following command to test:
          curl -X POST -d 'json={"json":"Here is my message"}' http://localhost:8888/debug.test
        3. Check the outcome by viewing the latest log entry:
          tail -n 1 /var/log/fluent/fluentd.log
      7. Configure Fluentd for OpenLineage.

        The following sample configuration shows how to collect messages on port 8888 and save them to files grouped by 5-minute windows. For details, use Configuration in the Flentd documentation.

        <source>
          @type http
          port 8888
          json_array true
        </source>
        <match openlineage>
          @type file
          path /fluentd/logs/openlineage
          path_suffix .jsonl
          json_array true						
          <buffer>
            timekey 5m
            timekey_use_utc true
            timekey_wait 5m
          </buffer>
          append true
          <format>
            @type json
          </format>
        </match>

        You can test the result after configuration by entering the following command:
        curl -X POST -d 'json={"json":"Here is my message"}' http://localhost:8888/openlineage

        And then check the outcome by entering the following command:
        ls -l /var/log/fluent/collect_packets

  2. Copy the files in OpenLineage format to the relevant directory in your cloud-based storage system. The files must be in one of the following:
    • An AWS S3 bucket.
    • An Azure Data Lake Storage container.
    • A Google Cloud Storage bucket.

What's next

You can now:

  • Create an AWS connection to an Edge or Collibra Cloud site
  • Create an Azure Data Lake Storage connection to an Edge or Collibra Cloud site
  • Create a Google Cloud Platform connection to an Edge or Collibra Cloud site