EMR / Dataproc / HDI

Running Apache Spark on Kubernetes differs from running this on virtual machine-based Hadoop clusters, which is the current mechanism provided by the existing CloudProc Dataproc service or competitive offerings like Amazon Web Services (AWS) Elastic MapReduce (EMR) and Microsoft's Azure HDInsight (HDI).

Each cloud provider has unique steps and configuration options. For more details on enabling agents for this deployment option refer to the Hadoop Integration section.

A detailed guide for EMR is provided below.

Collibra Data Quality & Observability on EMR Architecture

Collibra DQ is able to use EMR as the compute space for data quality jobs (DQ jobs). While it is possible to simply operate a long running EMR cluster, EMR's intended operating model is ephemeral infrastructure. Using EMR as an ephemeral compute space is the most cost effective approach both in terms of operational effort and infrastructure costs. Collibra DQ makes it possible to seamlessly leverage EMR in this operating model. When there is not an EMR cluster available, Collibra DQ users are still able to browse datasets and DQ results in DQ Web. However, if a user attempts to deploy a DQ Job, they will see a red light icon next to the target agent. If the user still wants to request a DQ Job, it will simply wait in queue until the target agent comes back online the next time an EMR cluster is available.

Prepare for Deployment

Note Before allowing Collibra DQ to use EMR as the compute space, make sure that DQ Web and the DQ Metastore are already deployed. For more installation details, see Standalone Install.

  1. Create a "bootstrap bucket" location in S3 where Collibra DQ binaries and bootstrap script (install-agent-emr.sh) will be staged. The EMR cluster instances will need to include an attached Role that has access to this location in order to bootstrap the cluster. This location should not contain any data or any kind of sensitive materials and thus should not require any special permissions. It just needs to be accessible by EMR clusters for bootstrap purposes.
  2. Create or modify an instance Profile Role that will be attached to EMR clusters so that it enables read access to the bootstrap bucket location. This Role is separate from the EMR service role that EMR will use to deploy the infrastructure.
  3. Stage the bootstrap script and the Collibra DQ binary package in the bootstrap location created above.
  4. Make sure that the VPC where the Collibra DQ Metastore is deployed is accessible from the VPC where EMR clusters will be deployed.
  5. Make sure that Security Groups applied to the Collibra DQ Metastore are configured to allow access from EMR master and worker Security Groups.
  6. Decide whether to use EMR 5.x or EMR 6.x. This is important because EMR 6 introduces Spark 3 and Scala 2.12. If EMR 6 is chosen, make sure Collibra DQ binaries were compiled for Spark 3 and Scala 2.12.
  7. (OPTIONAL) Create and store a private key to access EMR instances.

Deploy an EMR Cluster

There are several ways to deploy EMR, but for dev-ops purposes, the best path is usually to use the AWS CLI utility. The example below will deploy an EMR cluster bootstrapped with Collibra DQ binaries and a functioning agent to deploy DQ Jobs.

Note When defining the Bootstrap Location argument, do not include "s3://". For example: If Bootstrap Location is s3://bucket/prefix then BOOTSTRAP_LOCATION="bucket/prefix".

Warning 
Before you deploy EMR, edit install-agent-emr.sh and add the following export variables to it for the setup script to run correctly:

export DQ_ADMIN_USER_PASSWORD
export DQ_ADMIN_USER_EMAIL

Important 
As of Collibra DQ version 2023.09, we changed our package names from owl-* to dq-*. Therefore, you need to replace the following lines in your EMR file referencing owl with dq:
aws s3 cp s3://${OWL_PACKAGE_LOCATION}/dq-$OWL_VERSION-package-full.tar.gz ./
tar -xvf dq-$OWL_VERSION-package-full.tar.gz

Copy
aws emr create-cluster \
                --auto-scaling-role EMR_AutoScaling_DefaultRole \
                --applications Name=Hadoop Name=Spark Name=Hive Name=Tez \
                --name owl-emr \
                --release-label emr-6.2.0 \
                --region ${EMR_REGION} \
                --ebs-root-volume-size 10 \
                --scale-down-behavior TERMINATE_AT_TASK_COMPLETION \
                --enable-debugging \
                --bootstrap-actions \
                "[{\"Path\":\"s3://${BOOTSTRAP_LOCATION}/install-agent-emr.sh\", \
                \"Args\":[ \
                \"${OWL_VERSION}\", \
                \"${OWL_AGENT_ID}\", \
                \"${METASTORE_HOST}:${METASTORE_PORT}/${METASTORE_DB}?currentSchema=owlhub\", \
                \"${METASTORE_USER}\", \
                \"${METASTORE_PASSWORD}\", \
                \"${BOOTSTRAP_LOCATION}\", \
                \"${LICENSE_KEY}\", \
                \"native\"], \"Name\":\"install-owl-agent\"}]" \
                --ec2-attributes "{ \
                \"KeyName\":\"${EMR_INSTANCE_PRIVATE_KEY_NAME}\", \
                \"InstanceProfile\":\"${BOOTSTRAP_ACCESS_ROLE}\", \
                \"SubnetId\":\"${EMR_SUBNET}\", \
                \"EmrManagedSlaveSecurityGroup\":\"${EMR_WORKER_SECURITY_GROUP}\", \
                \"EmrManagedMasterSecurityGroup\":\"${EMR_WORKER_SECURITY_GROUP}\" \
                }" \
                --service-role ${EMR_SERVICE_ROLE} \
                --log-uri s3n://${EMR_LOG_LOCATION} \
                --instance-groups "[ \
                {\"InstanceCount\":1,\"InstanceGroupType\":\"MASTER\",\"InstanceType\":\"${EMR_MASTER_INSTANCE_TYPE}\",\"Name\":\"Master - 1\"}, \
                {\"InstanceCount\":3,\"InstanceGroupType\":\"CORE\",\"InstanceType\":\"${EMR_CORE_INSTANCE_TYPE}\",\"Name\":\"Core - 2\"} \
            ]" 

Configure Agent

Once the EMR cluster and the DQ Agent is deployed, it needs to be configured in DQ Web.

  1. Sign in to your web instance of Collibra DQ.
  2. Hover your cursor over the icon and click Admin Console.
    The Admin Console opens.
  3. Click the Remote Agent tile.
  4. The Agent Management page opens.
    In the following example, the newly created agent has a green indicator to show that it is healthy and active.

    Healthy DQ Agent example

  1. Click the pencil icon to the right of the Connections column.
  2. The Edit Agent modal appears.
  3. From the Default Deploy Mode drop-down list, select Cluster.
  4. From the Default Master drop-down list, select Yarn.

Configuring a yarn agent example

  1. Click Save.
  2. On the Agent Management page, click the chain link icon next to the Connections column to map the DQ Agent to Connections.
  3. The Edit Agent modal appears. Agent to Connection Management example

    Note Any data sources that are not listed in the right hand pane are not visible to the agent. Refer to the Agent section for more details.

  4. Click Update.

Run DQ Jobs

You can now use EMR to run DQ Jobs on your data. Refer to the Explorer (no-code) section for more details.