Connecting to Network File Storage (NFS)
This section contains an overview of Network File Storage (NFS).
General information
Field | Description |
---|---|
Data source | Network File Storage (NFS) |
Supported versions | N/A |
Connection string | nfs:///
|
Packaged? |
|
Certified? |
|
Supported features | |
Analyze data
|
|
Archive breaking records
|
|
Estimate job
|
Note Estimate job is only available for NFS connections on Standalone deployments of Collibra DQ. |
Pushdown
|
|
Processing capabilities | |
Spark agent
|
Note
|
Yarn agent
|
Note Yarn agent is available for NFS connections on Standalone deployments of Collibra DQ. Additional configurations are required for Kubernetes and Hadoop deployments. |
Minimum user permissions
In order for Collibra DQ to access your local file system, you need the following permissions.
- Ensure that you can mount the NFS in your Collibra DQ service locally.
- Ensure that the Linux user has read permissions on the path that runs Collibra DQ services.
Recommended and required connection properties
Required | Connection Property | Type | Value |
---|---|---|---|
|
Name | Text | The unique name used for your connection. |
|
Connection URL | String |
The connection string value of your NFS connection. nfs:///your/directory/path/ |
|
Target Agent | Text | The Agent used to submit your DQ Job. |
|
Auth Type | Option |
The method to authenticate your connection. Note Auth Type is always |
|
Driver Properties | String |
The configurable driver properties for your connection. Multiple properties must be semicolon delimited. For example, abc=123;test=true |
Authentication
Auth Type must be set to NFS.
Set up shared storage on self-hosted Kubernetes deployments
On self-hosted Kubernetes deployments, in order to successfully run DQ Jobs from an NFS connection type, volumes must be shared across all DQ pods (DQ Web, DQ jobs, and Livy). The following steps describe how to set up shared storage.
- In AWS, provision EFS and set up the EFS CSI driver at the AWS EKS cluster where DQ is deployed. For more information, refer to the Amazon EKS document, Store an elastic file system with Amazon EFS.
- Using static provisioning, create a persistent volume claim against the EFS storage class at the DQ deployed namespace. For more information, refer to the Amazon EKS document, Store an elastic file system with Amazon EFS.
- Update the DQ Web and DQ Agent stateful sets to include the extra mount point based on the previously created PVC. For example:
- In DQ, create an NFS connection. In the Connection URL field, point to /opt/efs-storage location
-
Create the DQ dataset from a file from the above NFS connection.
Optionally, to place the dataset or files at the shared storage, have a temporary pod spin up to copy the files to that location. Refer to the following pod template:
- Add the following extra configuration options to the owl command line. Use the following example for guidance:
---
volumes:
- name: owldq-efs-storage
persistentVolumeClaim:
claimName: efs-claim
...
volumeMounts:
- name: owldq-efs-storage
mountPath: /opt/efs-storage
---
apiVersion: v1
kind: Pod
metadata:
name: temporary-copy
spec:
containers:
- name: temporary-copy
image: nginx
volumeMounts:
- name: external
mountPath: /opt/efs-storage
volumes:
- name: external
persistentVolumeClaim:
claimName: efs-claim
Once the above pod is deployed at the DQ namespace, copy the contents from local to the shared storage location using the following command:
k cp customer_transactions.csv temporary-copy:/opt/efs-storage/
-numexecutors 1 -executormemory 1g -f "/opt/efs-storage/customer_transactions.csv" -h owlpostgres.chzid9w0hpyi.us-east-1.rds.amazonaws.com:5432/dev?currentSchema=validation -ds customer_transactions_csv_2 -master k8s:// -drivermemory 1g -deploymode cluster -rd "2023-10-11" -bhlb 10 -fullfile -loglevel INFO -cxn nfs-dataset
-conf spark.kubernetes.driver.podTemplateFile=local:///opt/owl/config/k8s-driver-template.yml,spark.kubernetes.executor.podTemplateFile=local:///opt/owl/config/k8s-executor-template.yml,spark.kubernetes.executor.volumes.persistentVolumeClaim.nfs-efs-pv.options.claimName=nfs-efs-pv-claim,spark.kubernetes.executor.volumes.persistentVolumeClaim.nfs-efs-pv.options.sizeLimit=5Gi,spark.kubernetes.executor.volumes.persistentVolumeClaim.nfs-efs-pv.mount.path=/opt/efs-storage,spark.kubernetes.executor.volumes.persistentVolumeClaim.nfs-efs-pv.mount.readOnly=false,spark.kubernetes.driver.volumes.persistentVolumeClaim.nfs-efs-pv.options.claimName=nfs-efs-pv-claim,spark.kubernetes.driver.volumes.persistentVolumeClaim.nfs-efs-pv.options.sizeLimit=5Gi,spark.kubernetes.driver.volumes.persistentVolumeClaim.nfs-efs-pv.mount.path=/opt/efs-storage,spark.kubernetes.driver.volumes.persistentVolumeClaim.nfs-efs-pv.mount.readOnly=false