Cloud native install

Install Collibra DQ on Kubernetes with Docker Containers

Collibra DQ provides the following Core Docker containers:

  • dq-agent: Launches the Apache Spark Jobs
  • dq-web: The Collibra DQ web application itself
  • Apache Spark: The runtime analytics engine
  • Postgres (persistent volume needed): The Collibra DQ metastore
  • Apache Livy:
    • Session Manager: How Collibra DQ can browse the HDFS, S3, GCS, or Azure Data Lake (ADL)
    • Interacts with Object Stores, similar to JDBC sources Explorer (estimate Jobs, get days with data, Filtergrams, etc.)

Prerequisites

  • Kubernetes cluster -- EKS, GKE, AKS, Openshift, Rancher
  • Helm(v3)
  • kubectl
  • Cloud command line SDK, such as gcloud CLI, AWS CLI or similar
  • External PostgreSQL DB version 11.9 and above, storage size 100GB, cores 4 to 8 memory to 4 to 8 GB
  • Private container registry -- to store images
  • LoadBalancer -- IngressController -- Ingress
  • Egress networking access
  • Helm Chart
  • Images, image access key
  • Minimum pod requirement -- 2 cores, 2GB RAM
  • If you bring in your own Spark executor pod launch template, ensure that the service account used to launch Spark executor pods has the permission to do so. Refer to the executor launch template for more information.

Steps

To install Collibra DQ on Kubernetes with Docker containers, follow these steps.

Sign in to the Kubernetes cluster

  1. Sign in to the Kubernetes cluster from a Linux compatible terminal.
  2. Create a namespace in the cluster using the following code snippet:
    Copy
    kubectl create namespace <owldq>

Pull images from the Collibra registry

Collibra DQ containers are located in the Docker Hub registry. Collibra provides a repo-key to access Collibra images in a .json file, which can be stored locally and used to login.

  1. Download the Docker .json repo-key.
  2. Run the following command:
  3. Copy
    docker login -u _json_key -p "$(cat repo-key.json)" https://gcr.io

    Note Image names with their versions are provided by Collibra.

  4. To pull the images, run the following Docker pull commands:
  5. Copy
    docker pull gcr.io/owl-hadoop-cdh/dq-agent:<version and build tag provided by Collibra>
    docker pull gcr.io/owl-hadoop-cdh/dq-web:<version and build tag provided by Collibra>
    docker pull gcr.io/owl-hadoop-cdh/dq-spark:<version and build tag provided by Collibra>
    docker pull gcr.io/owl-hadoop-cdh/dq-livy:<version and build tag provided by Collibra>

Push images into your private registry

  1. Sign in to your private Docker container registry.
  2. Tag and push the images from Collibra to your private registry, by using the following commands:
Copy
docker tag gcr.io/owl-hadoop-cdh/dq-web:2023.11 <registryURL>/dq-web:2023.11
docker push <registryURL>/dq-web:2023.11

Example:
docker tag [OPTIONS] IMAGE [:TAG][REGISTRYHOST/][USERNAME/]NAME[:TAG]

docker tag push NAME[:TAG]

Create an SSL keystore secret

kubectl create secret generic dq-ssl-secret \
--from-file /path/to/keystore.jks \
--namespace <namespace>

Warning The file name that you use in the --from-file argument should be keystore.jks. If the file name is anything else, you must include an additional argument specifying the keystore file name in the Helm command.

Create a pull secret

Note  Deploying containers directly from the Collibra image repository is not recommended. You should only access the Collibra image registry for the initial download and validation of Docker images. After this, you should upload and store images to your private registry to provide you control over when the images are updated and eliminate any operational dependencies on Collibra's repository.

  1. To create a pull secret, use the following code snippet:

Copy
kubectl create secret docker-registry dq-pull-secret
--docker-server=https://gcr.io
--docker-username=_json_key
--docker-email=<email of customer>
--docker-password="$(cat repo-key.json)"
--namespace <dq>

Note If your private registry is used for images and if they are accessible from within the Kubernetes cluster, this secret need not be created. If credentials are required to access your private registry, create this secret by modifying the docker-server URL and docker-password.

Helm chart

For more detailed information about the Helm Chart, see Cloud native.

Unzip and store the Helm charts provided by Collibra on a Linux compatible deployment location.

Note Once you have your Collibra DQ license, you will receive an email from Collibra that includes the Helm Charts as zip files.

There should be two folders and two files:

  • drwxrwxr-x -- templates
  • drwxrwxr-x -- charts
  • -rw-rw-r-- Chart.yaml
  • -rw-rw-r-- values.yaml

There are two ways of passing parameters. While deploying, parameter values can be passed:

  1. Using the values.yaml file or,
  2. Using the Helm set commands.

Note The set commands take precedence over the values.yaml file.

Deploy Collibra Data Quality & Observability

Once you have created a namespace and added all of the required secrets, you can begin the deployment of Collibra DQ.

Minimal install

Install Web, Agent, and metastore. Collibra DQ is inaccessible until you manually add an Ingress or another type of externally accessible service.

Warning  All of the following examples will pull containers directly from the Collibra DQ secured container registry. In most cases, InfoSec policies require that containers are sourced from a private container repository controlled by the local Cloud Ops team. Make sure to to add --set global.image.repo=</url/of/private-repo> so that you use only approved containers.

Note  The metastore container must start first as the other containers use it to write data. On your initial deployment, the other containers might start before the metastore and fail.

Copy
helm upgrade --install --namespace <namespace> \
--set global.version.dq=<cdq-version> \
--set global.version.spark=<cdq-spark-version> \
--set global.configMap.data.license_key=<cdq-license-key> \
--set global.configMap.data.license_name=<your-license-name> \
--set global.web.admin.email=${email} \
--set global.web.admin.password=${password} \
--set global.web.service.type=ClusterIP \
--set global.image.repo=<pathTolmageRepo> \
<deployment-name> \
/path to the helm chart root folder>
Value Description
<namespace> Enter the namespace that you created in the Sign in to the Kubernetes cluster step.
<cdq-version> Enter the version from the web image suffix. For example, 2023.11 from the image, dq-web:2023.11.
<cdq-spark-version> Enter the Spark version from the Spark image suffix. For example, 3.4.1-2023.11 from the image, spark:3.4.1-2023.11.
<cdq-license-key> Enter the license key provided to you by Collibra.
<your-license-name> Enter the license name provided to you by Collibra.
${email} Enter the default admin user email associated with the admin account.
${password}

Enter the default admin user password for the admin account.

The password must adhere to the following password policy:

  • A minimum of 8 characters.
  • A maximum of 72 characters.
  • At least one upper-case character.
  • At least one numeric character.
  • At least one supported special character (!@#%$^&*?_~).
  • Cannot contain the user ID (admin).

Note If a password that does not meet the password policy is entered, the install process proceeds as though the password is accepted, but the admin user becomes locked out. If this occurs, rerun the Helm command with a password that meets the password policy and restart the web pod.

<pathTolmageRepo> This is your private registry key, where the Collibra images are available. When this is not provided, you will pull the images from the Collibra image registry, for which you should create a pull secret with the repo key provided by Collibra. See Create a pull secret for more details.
<deployment-name> Any name of your choice for this deployment.

Note If you optionally pass credentials for the Postgres Metastore, ensure that you do not use the $ symbol in the global.metastore.pass variable, as it is an unsupported special character for Postgres Metastore passwords.

Externally Accessible Service

Perform the Minimal install and add a preconfigured NodePort or LoadBalancer service to provide access to the Web.

Warning  A LoadBalancer service type requires that the Kubernetes platform is integrated with a Software Defined Network solution. This will generally be true for the Kubernetes services offered by major cloud vendors. Private cloud platforms more commonly use Ingress controllers. Check with the infrastructure team before attempting to use LoadBalancer service type.

Copy
helm upgrade --install --namespace <namespace> \
--set global.version.dq=<cdq-version> \
--set global.version.spark=<cdq-spark-version> \
--set global.configMap.data.license_key=<cdq-license-key> \
--set global.configMap.data.license_name=<your-license-name> \
--set global.web.admin.email=${email} \
--set global.web.admin.password=${password} \
--set global.web.service.type=<NodePort || LoadBalancer> \
<deployment-name> \
/path/to/chart/dq

Externally Accessible with SSL Enabled

Perform the install with external service but with SSL enabled.

Note  Ensure you have already deployed a keystore containing a key to the target namespace with a secret name that matches the global.web.tls.key.secretName argument (dq-ssl-secret by default). Also, ensure that the secret's key name matches the global.web.tls.key.store.name argument (dqkeystore.jks by default).

Copy
helm upgrade --install --namespace <namespace> \
--set global.version.dq=<cdq-version> \
--set global.version.spark=<cdq-spark-version> \
--set global.configMap.data.license_key=<cdq-license-key> \
--set global.configMap.data.license_name=<your-license-name> \
--set global.web.admin.email=${email} \
--set global.web.admin.password=${password} \
--set global.web.service.type=<NodePort || LoadBalancer> \
--set global.web.tls.enabled=true \
--set global.web.tls.key.secretName=dq-ssl-secret \
--set global.web.tls.key.alias=<key-alias> \
--set global.web.tls.key.type=<JKS || PKCS12> \
--set global.web.tls.key.pass=<keystore-pass> \
--set global.web.tls.key.store.name=keystore.jks \ 
<deployment-name> \
/path/to/chart/dq

Externally Accessible with SSL Enabled on OpenShift 4.x

Install Collibra DQ with an external service on an OpenShift project with RunAsAny enforced and SSL enabled.

Copy
helm upgrade --install --namespace <namespace> \
--set global.version.dq=<cdq-version> \
--set global.version.spark=<cdq-spark-version> \
--set global.configMap.data.license_key=<cdq-license-key> \
--set global.configMap.data.license_name=<your-license-name> \
--set global.web.admin.email=${email} \
--set global.web.admin.password=${password} \
--set global.web.service.type=<NodePort || LoadBalancer> \
--set global.web.tls.enabled=true \
--set global.web.tls.key.secretName=dq-ssl-secret \
--set global.web.tls.key.alias=<key-alias> \
--set global.web.tls.key.type=<JKS || PKCS12> \
--set global.web.tls.key.pass=<keystore-pass> \
--set global.web.tls.key.store.name=keystore.jks \
--set global.security.securityContextConstraint.runAsAny=true \
<deployment-name> \
/path/to/chart/dq

Externally Accessible and History Server for GCS Log Storage

Perform the install with external service and Spark History Server enabled. In the following example, the target log storage system is GCS.

Note For Collibra DQ to be able to write Spark logs to GCS, create a secret from the JSON key file of a service account that has access to the log bucket. For more detailed information, see Cloud storage credentials.

Copy
helm upgrade --install --namespace <namespace> \
--set global.version.dq=<cdq-version> \
--set global.version.spark=<cdq-spark-version> \
--set global.configMap.data.license_key=<cdq-license-key> \
--set global.configMap.data.license_name=<your-license-name> \
--set global.web.admin.email=${email} \
--set global.web.admin.password=${password} \
--set global.web.service.type=<NodePort || LoadBalancer> \
--set global.spark_history.enabled=true \
--set global.spark_history.logDirectory=gs://logs/spark-history/ \
--set global.spark_history.service.type=<NodePort || LoadBalancer> \
--set global.cloudStorage.gcs.enableGCS=true \
<deployment-name> \
/path/to/chart/dq

Externally Accessible and History Server for S3 Log Storage

Perform the install with external service and Spark History Server enabled. In this example, the target log storage system is S3.

Note  For Collibra DQ to be able to write Spark logs to S3, makes sure that an Instance Profile IAM Role with access to the log bucket is attached to all nodes serving the target namespace. For more detailed information, see Cloud storage credentials.

Copy
helm upgrade --install --namespace <namespace> \
--set global.version.dq=<cdq-version> \
--set global.version.spark=<cdq-spark-version> \
--set global.configMap.data.license_key=<cdq-license-key> \
--set global.configMap.data.license_name=<your-license-name> \
--set global.web.admin.email=${email} \
--set global.web.admin.password=${password} \
--set global.web.service.type=<NodePort || LoadBalancer> \
--set global.spark_history.enabled=true \
--set global.spark_history.logDirectory=s3a://logs/spark-history/ \
--set global.spark_history.service.type=<NodePort || LoadBalancer> \
--set global.cloudStorage.s3.enableS3=true \
<deployment-name> \
/path/to/chart/dq

Externally Accessible with External Metastore

Perform the install with external service and an external metastore, for example AWS RDS, Google Cloud SQL, or just PostgresSQL on its own instance.

Warning Collibra DQ currently supports PostgreSQL 9.6 and newer.

Copy
helm upgrade --install --namespace <namespace> \
--set global.version.dq=<cdq-version> \
--set global.version.spark=<cdq-spark-version> \
--set global.configMap.data.license_key=<cdq-license-key> \
--set global.configMap.data.license_name=<your-license-name> \
--set global.web.admin.email=${email} \
--set global.web.admin.password=${password} \
--set global.web.service.type=<NodePort || LoadBalancer> \
--set global.metastore.enabled=false                                        
--set global.configMap.data.metastore_url=jdbc:postgresql://<host>:<port>/<database>
--set global.configMap.data.metastore_user=<user> \
--set global.configMap.data.metastore_pass=<password> \
<deployment-name> \
/path/to/chart/dq

Warning The $ symbol is not a supported special character in your Postgres Metastore password.

Externally Accessible and History Server for Windows Azure Storage Blob (WASB) Log Storage

Perform the install with external service and Spark History Server enabled.

Note In this example, the target log storage system is Windows Azure Storage Blob (WASB). However, wherever you see azureblobstorage, you can insert your own storage solution (e.g., ADLS, S3, or GCS).

Copy
helm upgrade --install --namespace <namespace> \
--set global.version.dq=<cdq-version> \
--set global.version.spark=<cdq-spark-version> \
--set global.configMap.data.license_key=<cdq-license-key> \
--set global.configMap.data.license_name=<your-license-name> \
--set global.web.admin.email=${email} \
--set global.web.admin.password=${password} \
--set global.web.service.type=<NodePort || LoadBalancer> \
--set global.spark_history.enabled=true \
--set global.spark_history.logDirectory=wasbs://spark-history[email protected]/ \
--set global.spark_history.service.type=<NodePort || LoadBalancer> \
--set global.cloudStorage.wasbs.enableWASBS=true \
--set global.cloudStorage.wasbs.storageContainerName=spark-history-logs \
--set global.cloudStorage.wasbs.storageAccountName=azureblobstorage \
--set global.cloudStorage.wasbs.storageAccountKey=XXXXXXXXXXXXXXXXXXXXXXXXXXX
<deployment-name> \
/path/to/chart/dq

Note To access Azure Blob Storage, you must have the correct permissions to the Storage Account Key. For more information, go to Connecting to Azure Blob Storage.

Troubleshooting and Helpful Commands

This guide is to provide the most common commands to run when troubleshooting a DQ environment that is deployed on Kubernetes. For a basic overview of Kubernetes and other relevant knowledge.

Copy
### Provide documentation on syntax and flags in the terminal

kubectl help

### To see how to use Kubernetes resources

kubectl api-resources -o wide

Viewing Kubernetes Resources

Copy
### Get Pods, their names & details in all Namespaces

kubectl get pods -A -o wide

### Get all Namespaces in a cluster

kubectl get namespaces

### Get Services in all Namespaces

kubectl get services -A -o wide

### List all deployments in all namespaces:

kubectl get deployments -A -o wide

Logs & Events

Copy
### List Events sorted by timestamp in all namespaces

kubectl get events -A --sort-by=.metadata.creationTimestamp

### Get logs from a specific pod:

kubectl logs [my-pod-name]

Resource Allocation

Copy
### If the Kubernetes Metrics Server is installed, 
### the top command allows you to see the resource consumption for nodes or podscode

kubectl top node
kubectl top pod

### If the Kubernetes Metrics Server is NOT installed, use

kubectl describe nodes | grep Allocated -A 10 

Configuration

Copy
### Get current-context

kubectl config current-context

### See all configs for the entire cluster

kubectl config view

Authorization Issues

Copy
### Check to see if I can read pod logs for current user & context

kubectl auth can-i get pods --subresource=log

### Check to see if I can do everything in my current namespace ("*" means all)

kubectl auth can-i '*' '*'