Connecting to Network File Storage (NFS)

This section contains an overview of Network File Storage (NFS).

General information

Field Description
Data source Network File Storage (NFS)
Supported versions N/A
Connection string nfs:///
Packaged?

Yes

Certified?

Yes

Supported features
Analyze data

Yes

Archive breaking records

No

Estimate job

Yes

Note Estimate job is only available for NFS connections on Standalone deployments of Collibra DQ.

Pushdown

No

Processing capabilities
Spark agent

Yes

Note 
  • Spark agent is available for NFS connections on Standalone deployments of Collibra DQ. Additional configurations are required for Kubernetes and Hadoop deployments.
  • A Spark configuration must be added to the command line to mount the NFS directory path and run the NFS remote connections in Kubernetes. Refer to the Review page for more information on appending using the command line.

    The following Spark configurations are related to mounting NFS volume in Kubernetes:

    • spark.kubernetes.driver.volumes.persistentVolumeClaim.nfs-pv-dq-dev.mount.path=/opt/owl/nfs-storage

    • spark.kubernetes.driver.volumes.persistentVolumeClaim.nfs-pv-dq-dev.options.claimName=nfs-pvc-dq-dev

    • spark.kubernetes.executor.volumes.persistentVolumeClaim.nfs-pv-dq-dev.mount.path=/opt/owl/nfs-storage

    • spark.kubernetes.executor.volumes.persistentVolumeClaim.nfs-pv-dq-dev.options.claimName=nfs-pvc-dq-dev

Yarn agent

Yes

Note Yarn agent is available for NFS connections on Standalone deployments of Collibra DQ. Additional configurations are required for Kubernetes and Hadoop deployments.

Minimum user permissions

In order for Collibra DQ to access your local file system, you need the following permissions.

  • Ensure that you can mount the NFS in your Collibra DQ service locally.
  • Ensure that the Linux user has read permissions on the path that runs Collibra DQ services.

Recommended and required connection properties

Required Connection Property Type Value

Yes

Name Text The unique name used for your connection.

Yes

Connection URL String

The connection string value of your NFS connection.

nfs:///your/directory/path/

No

Target Agent Text The Agent used to submit your DQ Job.

Yes

Auth Type Option

The method to authenticate your connection.

Note Auth Type is always NFS.

No

Driver Properties String

The configurable driver properties for your connection. Multiple properties must be semicolon delimited. For example, abc=123;test=true

Authentication

Auth Type must be set to NFS.

Set up shared storage on self-hosted Kubernetes deployments

On self-hosted Kubernetes deployments, in order to successfully run DQ Jobs from an NFS connection type, volumes must be shared across all DQ pods (DQ Web, DQ jobs, and Livy). The following steps describe how to set up shared storage.

Important Make sure that Hadoop is not configured in the DQ deployment. Otherwise, the DQ job will refer to the HDFS location instead of the local filesystem location of the shared storage.
  1. In AWS, provision EFS and set up the EFS CSI driver at the AWS EKS cluster where DQ is deployed. For more information, refer to the Amazon EKS document, Store an elastic file system with Amazon EFS.
  2. Using static provisioning, create a persistent volume claim against the EFS storage class at the DQ deployed namespace. For more information, refer to the Amazon EKS document, Store an elastic file system with Amazon EFS.
  3. Update the DQ Web and DQ Agent stateful sets to include the extra mount point based on the previously created PVC. For example:
  4. Copy
    ---
    volumes:
      - name: owldq-efs-storage
        persistentVolumeClaim:
        claimName: efs-claim
    ...
    volumeMounts:
      - name: owldq-efs-storage
        mountPath: /opt/efs-storage
  5. In DQ, create an NFS connection. In the Connection URL field, point to /opt/efs-storage location
  6. Create the DQ dataset from a file from the above NFS connection.

    Optionally, to place the dataset or files at the shared storage, have a temporary pod spin up to copy the files to that location. Refer to the following pod template:

  7. Copy
    ---
    apiVersion: v1
    kind: Pod
    metadata:
      name: temporary-copy
    spec:
      containers:
        - name: temporary-copy
          image: nginx
          volumeMounts:
           - name: external
            mountPath: /opt/efs-storage
      volumes:
        - name: external
          persistentVolumeClaim:
            claimName: efs-claim

    Once the above pod is deployed at the DQ namespace, copy the contents from local to the shared storage location using the following command:

    Copy
    k cp customer_transactions.csv temporary-copy:/opt/efs-storage/
  8. Add the following extra configuration options to the owl command line. Use the following example for guidance:
  9. Note These options can be added against the Remote Agent freeform append configurations to set these values by default for the DQ datasets.
    Copy
    -numexecutors 1 -executormemory 1g -f "/opt/efs-storage/customer_transactions.csv" -h owlpostgres.chzid9w0hpyi.us-east-1.rds.amazonaws.com:5432/dev?currentSchema=validation -ds customer_transactions_csv_2 -master k8s:// -drivermemory 1g -deploymode cluster -rd "2023-10-11" -bhlb 10 -fullfile -loglevel INFO -cxn nfs-dataset 
    -conf spark.kubernetes.driver.podTemplateFile=local:///opt/owl/config/k8s-driver-template.yml,spark.kubernetes.executor.podTemplateFile=local:///opt/owl/config/k8s-executor-template.yml,spark.kubernetes.executor.volumes.persistentVolumeClaim.nfs-efs-pv.options.claimName=nfs-efs-pv-claim,spark.kubernetes.executor.volumes.persistentVolumeClaim.nfs-efs-pv.options.sizeLimit=5Gi,spark.kubernetes.executor.volumes.persistentVolumeClaim.nfs-efs-pv.mount.path=/opt/efs-storage,spark.kubernetes.executor.volumes.persistentVolumeClaim.nfs-efs-pv.mount.readOnly=false,spark.kubernetes.driver.volumes.persistentVolumeClaim.nfs-efs-pv.options.claimName=nfs-efs-pv-claim,spark.kubernetes.driver.volumes.persistentVolumeClaim.nfs-efs-pv.options.sizeLimit=5Gi,spark.kubernetes.driver.volumes.persistentVolumeClaim.nfs-efs-pv.mount.path=/opt/efs-storage,spark.kubernetes.driver.volumes.persistentVolumeClaim.nfs-efs-pv.mount.readOnly=false