Configure Spark scratch disk space to handle large datasets
When you run large datasets, your Spark cluster's resources need to be sized appropriately to handle the computation of DQ rules. This section shows you how to configure on-demand Spark scratch disk space to handle huge Spark jobs.
Prerequisites
Before you configure Spark scratch disk space, ensure you have:
- A Collibra DQ deployment on a cloud native platform, such as EKS, GKS, and so on.
- A Kubernetes service account for Collibra DQ that meets the permission requirements to provision dynamic Persistent Volume Claims (PVCs) in ReadWriteOnce (RWO).
- Execute permissions on the /tmp directory or the directory used for temporary storage, to ensure Spark jobs launch successfully.
Note This capability is only available for Collibra DQ versions 2023.02 or newer.
Steps
- Open a terminal session.
- Set the following values to deploy or update the Collibra DQ app with the provided Helm charts:Copy
--set global.configMap.data.spark_scratch_type="persistentVolumeClaim" --set global.configMap.data.spark_scratch_storage_class="standard" --set global.configMap.data.spark_scratch_storage_size="20Gi" --set global.configMap.data.spark_scratch_local_path="/tmp/scratch"
- If your Collibra DQ app is managed by an alternative Kubernetes package manager, ensure you update the owl-agent-configmap with the following variables:
- Restart the DQ agent pod to complete the updates.
- Sign in to Collibra DQ.
- Hover your cursor over the icon and click Admin Console.
The Admin Console opens. - Click Agent Configuration.
The Agent Configuration page opens. - Click the Actions, then click Edit.
The Edit Agent modal opens. - In the Free form (Appended) text box option, enter the following string value:
- Click Submit.
- Run a sample DQ job to process DQ rules against a dataset of your choice. The corresponding Spark job dynamically provisions PVCs for the driver and each executor to handle large datasets by juggling between Memory and Disk at the mouth path of the PVC.
Note The values in the previous example are settings for GKE-based deployments.
Copy
KUBERNETES_SCRATCH_TYPE: persistentVolumeClaim
KUBERNETES_SCRATCH_STORAGE_CLASS: standard
KUBERNETES_SCRATCH_STORAGE_SIZE: 20Gi
KUBERNETES_SCRATCH_LOCAL_PATH: /tmp/scratch
Note Pick the value of Storage Class, Storage Size and mouthPath specific to your requirements.
Copy
-conf spark.kubernetes.driver.podTemplateFile=local:///opt/owl/config/k8s-driver-template.yml,spark.kubernetes.executor.podTemplateFile=local:///opt/owl/config/k8s-executor-template.yml,spark.kubernetes.driver.reusePersistentVolumeClaim=false
Tip If you add any additional configuration properties based on your specific requirements, separate them with commas.
What's next?
For more information, go to the official Spark documentation on: