Configure Spark scratch disk space to handle large datasets

When you run large datasets, your Spark cluster's resources need to be sized appropriately to handle the computation of DQ rules. This section shows you how to configure on-demand Spark scratch disk space to handle huge Spark jobs.

Prerequisites

Before you configure Spark scratch disk space, ensure you have:

A Collibra DQ deployment on a cloud native platform, such as EKS, GKS, and so on.
A Kubernetes service account for Collibra DQ that meets the permission requirements to provision dynamic Persistent Volume Claims (PVCs) in ReadWriteOnce (RWO).
Execute permissions on the /tmp directory or the directory used for temporary storage, to ensure Spark jobs launch successfully.

Note This capability is only available for Collibra DQ versions 2023.02 or newer.

Steps

Open a terminal session.

Set the following values to deploy or update the Collibra DQ app with the provided Helm charts:

Copy

--set global.configMap.data.spark_scratch_type="persistentVolumeClaim" --set global.configMap.data.spark_scratch_storage_class="standard" --set global.configMap.data.spark_scratch_storage_size="20Gi" --set global.configMap.data.spark_scratch_local_path="/tmp/scratch"

Note The values in the previous example are settings for GKE-based deployments.

If your Collibra DQ app is managed by an alternative Kubernetes package manager, ensure you update the owl-agent-configmap with the following variables:

Copy

KUBERNETES_SCRATCH_TYPE: persistentVolumeClaim
KUBERNETES_SCRATCH_STORAGE_CLASS: standard
KUBERNETES_SCRATCH_STORAGE_SIZE: 20Gi
KUBERNETES_SCRATCH_LOCAL_PATH: /tmp/scratch

Note Pick the value of Storage Class, Storage Size and mouthPath specific to your requirements.

Restart the DQ agent pod to complete the updates.

Sign in to Collibra DQ.
Hover your cursor over the icon and click Admin Console.
The Admin Console opens.
Click Agent Configuration.
The Agent Configuration page opens.
Click the Actions, then click Edit.
The Edit Agent modal opens.
In the Free form (Appended) text box option, enter the following string value:

Copy

-conf spark.kubernetes.driver.podTemplateFile=local:///opt/owl/config/k8s-driver-template.yml,spark.kubernetes.executor.podTemplateFile=local:///opt/owl/config/k8s-executor-template.yml,spark.kubernetes.driver.reusePersistentVolumeClaim=false

Tip If you add any additional configuration properties based on your specific requirements, separate them with commas.

Optionally, increase driver pod memory. This may be needed for large datasets, if the driver pod is crashing or jobs end up in Unknown State. Use the following Spark command:

Copy

-conf spark.kubernetes.memoryOverheadFactor=0.4

Click Submit.
Run a sample DQ job to process DQ rules against a dataset of your choice. The corresponding Spark job dynamically provisions PVCs for the driver and each executor to handle large datasets by juggling between Memory and Disk at the mouth path of the PVC.

What's next

For more information, go to the official Spark documentation on: