Configure Spark scratch disk space to handle large datasets

When you run large datasets, your Spark cluster's resources need to be sized appropriately to handle the computation of DQ rules. This section shows you how to configure on-demand Spark scratch disk space to handle huge Spark jobs.


Before you configure Spark scratch disk space, ensure you have:

  • A Collibra DQ deployment on a cloud native platform, such as EKS, GKS, and so on.
  • A Kubernetes service account for Collibra DQ that meets the permission requirements to provision dynamic Persistent Volume Claims (PVCs) in ReadWriteOnce (RWO).

Note This capability is only available for Collibra DQ versions 2023.02 or newer.


  1. Open a terminal session.
  2. Set the following values to deploy or update the Collibra DQ app with the provided Helm charts:
    --set"persistentVolumeClaim" --set"standard" --set"20Gi" --set"/tmp/scratch"
  3. Note The values in the previous example are settings for GKE-based deployments.

    1. If your Collibra DQ app is managed by an alternative Kubernetes package manager, ensure you update the owl-agent-configmap with the following variables:
    2. Copy
      KUBERNETES_SCRATCH_TYPE: persistentVolumeClaim

      Note Pick the value of Storage Class, Storage Size and mouthPath specific to your requirements.

    3. Restart the DQ agent pod to complete the updates.
  4. Sign in to Collibra DQ.
  5. Hover your cursor over the icon and click Admin Console.
    The Admin Console opens.
  6. Click Remote Agent.
    The Agent Management page opens.
  7. Click the pencil icon.
    The Edit Agent modal opens.
  8. In the Free Form (Appended) option, enter the following string value:
  9. Copy
    -conf spark.kubernetes.driver.podTemplateFile=local:///opt/owl/config/k8s-driver-template.yml,spark.kubernetes.executor.podTemplateFile=local:///opt/owl/config/k8s-executor-template.yml

    Tip If you add any additional configuration properties based on your specific requirements, separate them with commas.

  10. Run a sample DQ job to process DQ rules against a dataset of your choice. The corresponding Spark job dynamically provisions PVCs for the driver and each executor to handle large datasets by juggling between Memory and Disk at the mouth path of the PVC.

What's next?

For more information, go to the official Spark documentation on: