Deploy on Self-hosted Kubernetes

After you have performed the steps in Install on self-hosted Kubernetes, you can begin the deployment of Collibra DQ.

Download Helm Chart Files

After you have your Collibra DQ license, you will receive an email from Collibra that includes the Helm Charts as zip files. Unzip the files on a Linux-compatible deployment location.

Note For background information on how Collibra DQ uses Helm charts, go to Installing Collibra Data Quality & Observability on Self-hosted Kubernetes.

The root directory, dq, contains the following directories and files:

  • drwxrwxr-x -- templates. Directory that contains a set of YAML files, including:
    • k8s-enpoint.secret.yaml
    • _helpers.tpl
    • hadoop-conf.yaml
    • serviceaccount.yaml
    • krb5-conf.yaml
    • rbac.yaml
    • dq-secret.yaml
    • keytab-secret.yaml
  • drwxrwxr-x -- charts. Directory that contains details for the core containers, including.
    • metastore
    • owl-livy
    • owl-agent
    • owl-web
    • spark-history-server
  • -rw-rw-r-- Chart.yaml. File that contains the DQ Helm chart meta-information.
  • -rw-rw-r-- values.yaml. File that contains the DQ configuration values.

You may pass parameters using one of two methods:

  • Using the values.yaml file (or)
  • Using the Helm set commands.

Note The Helm set commands take precedence over the values.yaml file.

Minimal Install Settings

This example provided in this section installs the DQ Web, DQ Agent, and DQ Metastore. Collibra DQ is inaccessible until you manually add an Ingress or another type of externally accessible service.

Warning  All of the following examples will pull containers directly from the Collibra DQ secured container registry. In most cases, InfoSec policies require that containers are sourced from a private container repository controlled by the local Cloud Ops team. Make sure to to add --set global.image.repo=</url/of/private-repo> so that you use only approved containers.

Note  The DQ Metastore container must start first as the other containers use it to write data. On your initial deployment, the other containers might start before the metastore and fail.

Copy
helm upgrade --install --namespace <namespace> \
--set global.version.dq=<cdq-version> \
--set global.version.spark=<cdq-spark-version> \
--set global.configMap.data.license_key=<cdq-license-key> \
--set global.configMap.data.license_name=<your-license-name> \
--set global.web.admin.email=${email} \
--set global.web.admin.password=${password} \
--set global.web.service.type=ClusterIP \
--set global.image.repo=<pathTolmageRepo> \
<deployment-name> \
/path to the helm chart root folder>
Value Description
<namespace> Enter the namespace that you created for this deployment.
<cdq-version> Enter the version from the web image suffix. For example, 2023.11 from the image, dq-web:2023.11.
<cdq-spark-version> Enter the Spark version from the Spark image suffix. For example, 3.4.1-2023.11 from the image, spark:3.4.1-2023.11.
<cdq-license-key> Enter the license key provided to you by Collibra.
<your-license-name> Enter the license name provided to you by Collibra.
${email} Enter the default admin user email associated with the admin account.
${password}

Enter the default admin user password for the admin account.

The password must adhere to the following password policy:

  • A minimum of 8 characters.
  • A maximum of 72 characters.
  • At least one upper-case character.
  • At least one numeric character.
  • At least one supported special character (!@#%$^&*?_~).
  • Cannot contain the user ID (admin).

Note If a password that does not meet the password policy is entered, the install process proceeds as though the password is accepted, but the admin user becomes locked out. If this occurs, rerun the Helm command with a password that meets the password policy and restart the web pod.

<pathTolmageRepo> This is your private registry key, where the Collibra images are available. When this is not provided, you will pull the images from the Collibra image registry, for which you should create a pull secret with the repo key provided by Collibra. See Install on Self-hosted Kubernetes for more details.
<deployment-name> Any name of your choice for this deployment.

Note If you optionally pass credentials for the Postgres Metastore, ensure that you do not use the $ symbol in the global.metastore.pass variable, as it is an unsupported special character for Postgres Metastore passwords.

The number of possible customizations is quite extensive and provides a great deal of flexibility for a wide variety of platforms. However, when deploying on a known platform (including EKS, GKE, and AKS), the number of required inputs is limited. In common cases, run a single CLI command including basic parameters like disable history server, configure the storage bucket for logs, specify the image repository, and so on.

Including an Externally Accessible Service

The following examples install the Collibra DQ with an externally accessible service.

Minimal Install with an Externally Accessible Service

The following example performs the minimal install and adds a preconfigured NodePort or LoadBalancer service to provide access to the Web.

Warning  A LoadBalancer service type requires that the Kubernetes platform is integrated with a Software Defined Network solution. This will generally be true for the Kubernetes services offered by major cloud vendors. Private cloud platforms more commonly use Ingress controllers. Check with the infrastructure team before attempting to use LoadBalancer service type.

Copy
helm upgrade --install --namespace <namespace> \
--set global.version.dq=<cdq-version> \
--set global.version.spark=<cdq-spark-version> \
--set global.configMap.data.license_key=<cdq-license-key> \
--set global.configMap.data.license_name=<your-license-name> \
--set global.web.admin.email=${email} \
--set global.web.admin.password=${password} \
--set global.web.service.type=<NodePort || LoadBalancer> \
<deployment-name> \
/path/to/chart/dq

Install with SSL Enabled

The following example performs the install with an externally accessible service, but with SSL enabled.

Note  Ensure you have already deployed a keystore containing a key to the target namespace with a secret name that matches the global.web.tls.key.secretName argument (dq-ssl-secret by default). Also, ensure that the secret's key name matches the global.web.tls.key.store.name argument (dqkeystore.jks by default).

Copy
helm upgrade --install --namespace <namespace> \
--set global.version.dq=<cdq-version> \
--set global.version.spark=<cdq-spark-version> \
--set global.configMap.data.license_key=<cdq-license-key> \
--set global.configMap.data.license_name=<your-license-name> \
--set global.web.admin.email=${email} \
--set global.web.admin.password=${password} \
--set global.web.service.type=<NodePort || LoadBalancer> \
--set global.web.tls.enabled=true \
--set global.web.tls.key.secretName=dq-ssl-secret \
--set global.web.tls.key.alias=<key-alias> \
--set global.web.tls.key.type=<JKS || PKCS12> \
--set global.web.tls.key.pass=<keystore-pass> \
--set global.web.tls.key.store.name=keystore.jks \ 
<deployment-name> \
/path/to/chart/dq

Install on OpenShift 4.x with SSL Enabled

The following example performs the install with an externally accessible service on an OpenShift project with RunAsAny enforced, with SSL enabled.

Copy
helm upgrade --install --namespace <namespace> \
--set global.version.dq=<cdq-version> \
--set global.version.spark=<cdq-spark-version> \
--set global.configMap.data.license_key=<cdq-license-key> \
--set global.configMap.data.license_name=<your-license-name> \
--set global.web.admin.email=${email} \
--set global.web.admin.password=${password} \
--set global.web.service.type=<NodePort || LoadBalancer> \
--set global.web.tls.enabled=true \
--set global.web.tls.key.secretName=dq-ssl-secret \
--set global.web.tls.key.alias=<key-alias> \
--set global.web.tls.key.type=<JKS || PKCS12> \
--set global.web.tls.key.pass=<keystore-pass> \
--set global.web.tls.key.store.name=keystore.jks \
--set global.security.securityContextConstraint.runAsAny=true \
<deployment-name> \
/path/to/chart/dq

Install with History Server for GCS Log Storage

The following example performs the install with an externally accessible service and Spark History Server enabled. In this example, the target log storage system is GCS.

Note For Collibra DQ to be able to write Spark logs to GCS, create a secret from the JSON key file of a service account that has access to the log bucket. For more detailed information, see System Requirements.

Copy
helm upgrade --install --namespace <namespace> \
--set global.version.dq=<cdq-version> \
--set global.version.spark=<cdq-spark-version> \
--set global.configMap.data.license_key=<cdq-license-key> \
--set global.configMap.data.license_name=<your-license-name> \
--set global.web.admin.email=${email} \
--set global.web.admin.password=${password} \
--set global.web.service.type=<NodePort || LoadBalancer> \
--set global.spark_history.enabled=true \
--set global.spark_history.logDirectory=gs://logs/spark-history/ \
--set global.spark_history.service.type=<NodePort || LoadBalancer> \
--set global.cloudStorage.gcs.enableGCS=true \
<deployment-name> \
/path/to/chart/dq

Install with History Server for S3 Log Storage

The following example performs the install with an externally accessible service and Spark History Server enabled. In this example, the target log storage system is S3.

Note  For Collibra DQ to be able to write Spark logs to S3, make sure that an Instance Profile IAM Role with access to the log bucket is attached to all nodes serving the target namespace. For more detailed information, see System Requirements.

Important We currently only support one set of S3 credentials to be configured in the Spark Server at a time within a single DQ Job. If you are using a Spark History Server bucket and an S3 connection, the S3 credentials must also have access to the S3 Spark History Server bucket.

Copy
helm upgrade --install --namespace <namespace> \
--set global.version.dq=<cdq-version> \
--set global.version.spark=<cdq-spark-version> \
--set global.configMap.data.license_key=<cdq-license-key> \
--set global.configMap.data.license_name=<your-license-name> \
--set global.web.admin.email=${email} \
--set global.web.admin.password=${password} \
--set global.web.service.type=<NodePort || LoadBalancer> \
--set global.spark_history.enabled=true \
--set global.spark_history.logDirectory=s3a://logs/spark-history/ \
--set global.spark_history.service.type=<NodePort || LoadBalancer> \
--set global.cloudStorage.s3.enableS3=true \
<deployment-name> \
/path/to/chart/dq

Install with External DQ Metastore

The following example performs the install with an externally accessible service and an external metastore, for example AWS RDS, Google Cloud SQL, or just PostgresSQL on its own instance.

Warning Collibra DQ currently supports PostgreSQL 9.6 and newer.

Copy
helm upgrade --install --namespace <namespace> \
--set global.version.dq=<cdq-version> \
--set global.version.spark=<cdq-spark-version> \
--set global.configMap.data.license_key=<cdq-license-key> \
--set global.configMap.data.license_name=<your-license-name> \
--set global.web.admin.email=${email} \
--set global.web.admin.password=${password} \
--set global.web.service.type=<NodePort || LoadBalancer> \
--set global.metastore.enabled=false                                        
--set global.configMap.data.metastore_url=jdbc:postgresql://<host>:<port>/<database>
--set global.configMap.data.metastore_user=<user> \
--set global.configMap.data.metastore_pass=<password> \
<deployment-name> \
/path/to/chart/dq

Warning The $ symbol is not a supported special character in your Postgres Metastore password.

Install with History Server for Windows Azure Storage Blob (WASB) Log Storage

The following example performs the install with an externally accessible service and Spark History Server enabled. In this example, the target log storage system is Windows Azure Storage Blob (WASB). However, wherever you see azureblobstorage, you can insert your own storage solution (for example, ADLS, S3, or GCS).

Note To access Azure Blob Storage, you must have the correct permissions to the Storage Account Key. For more information, go to Connecting to Azure Blob Storage.

Copy
helm upgrade --install --namespace <namespace> \
--set global.version.dq=<cdq-version> \
--set global.version.spark=<cdq-spark-version> \
--set global.configMap.data.license_key=<cdq-license-key> \
--set global.configMap.data.license_name=<your-license-name> \
--set global.web.admin.email=${email} \
--set global.web.admin.password=${password} \
--set global.web.service.type=<NodePort || LoadBalancer> \
--set global.spark_history.enabled=true \
--set global.spark_history.logDirectory=wasbs://spark-history[email protected]/ \
--set global.spark_history.service.type=<NodePort || LoadBalancer> \
--set global.cloudStorage.wasbs.enableWASBS=true \
--set global.cloudStorage.wasbs.storageContainerName=spark-history-logs \
--set global.cloudStorage.wasbs.storageAccountName=azureblobstorage \
--set global.cloudStorage.wasbs.storageAccountKey=XXXXXXXXXXXXXXXXXXXXXXXXXXX
<deployment-name> \
/path/to/chart/dq

Troubleshooting

Troubleshooting self-hosted Kubernetes Install

What's next?