Cloud

Collibra DQ Cloud is delivered as a Software-as-a-Service offering. This section describes the installation process and details the Edge component for Collibra DQ Cloud.

Requirements

To use Collibra DQ Cloud, you must ensure that the following system requirements are met.

Resource Notes Provisioned by
Collibra DQ Version 2022.02 or newer with Edge mode enabled. Collibra
Collibra Edge Site Version 2022.02 or newer. Customer
Postgres Version 11 or newer. Customer

Diagram

The following diagram shows the role of Edge and how it connects to Collibra DQ. Edge connects to your data sources and then provides data from them back to Collibra DQ, where you can scan and profile your data for quality issues.

Future-state visuals can be found on the diagrams page

Prerequisites

You need the following minimum hardware and software requirements:

VM

This is where your Edge site is installed. You need:

Note For medium to large workloads of more than 100M rows by 100 columns, we recommend that your VM has a minimum of 32 cores, 128 GB memory, and 500 GB of free storage.

Edge installation requirements can be found here.

Postgres

This is where your DQ Job results are stored. You need:

  • Version 11 or newer
  • A minimum of 100 GB of free storage
  • A minimum of 4 cores
  • Network access to and from the VM where Edge is installed
  • User with ownership rights over the target database

Steps

1. Obtain a Secure Collibra DQ Web URL

This is provisioned by Collibra. Along with the URL, credentials will be provided to access your instance.

2. Install Postgres

This is provisioned by the customer. There are several ways to install Postgres. You should follow your existing company process to provision a Postgres instance, such as RDS, Azure SQL, Cloud SQL, or standard install with a package manager. Ensure your version is 11 or newer.

Important Remember your Postgres IP and login credentials. This is required when deploying the Edge site.

3. Install Edge

Refer to Edge documentation for system requirements.

Navigate to the Edge Site Management panel in the Admin Console

Add an Edge Site and provide a name and description

Using the Actions drop-down, download the Edge installer package locally

Warning Because connections must have an exact relationship between the Edge site and the datasource hostname, do not delete your Edge Site from the Edge Site Management page.

Upload the Edge installer package to your VM that meets the Edge system requirements above. An example scp command is below, but you can do this several ways.

Copy
scp -i ~/Downloads/vm-key.pem ~/Downloads/<installer>.tgz user@<host-or-ip>:/home/user/<installer>.tgz

SSH to your VM after uploading the installer package. Untar the .tgz

Copy
tar -xvf <installer>.tgz

Install prerequisite Edge packages.

Copy
sudo yum install -y container-selinux selinux-policy-base

sudo yum install -y sudo yum install -y https://rpm.rancher.io/k3s/stable/common/centos/8/noarch/k3s-selinux-1.2.2.el8.noarch.rpm

sudo firewall-cmd --zone=trusted --add-interface=lo --permanent

sudo firewall-cmd --zone=trusted --add-interface=cni0 --permanent

sudo firewall-cmd --reload

Confirm you have the correct Collibra DQ version pointer from your Cloud instance, for example 2023.01-ABDGCSHILM-2095.

dq cloud version

Remember your Postgres IP and credentials from the previous step.

Install Edge w/ DQ w/ the correct parameters

Copy
sudo /home/centos/install-master.sh --storage-path /var/edge properties.yaml -r registries.yaml --set collibra_edge.collibra.dq.enabled=true,collibra_edge.collibra.dq.targetRevision=2023.01-ABDGCSHILM-2095,collibra_edge.collibra.dq.sparkVersion=3.4.1,collibra_edge.collibra.dq.metastoreUrl=jdbc:postgresql://<your-postgres-ip>:5432/postgres,collibra_edge.collibra.dq.metastoreUser=<your-postgres-user>,collibra_edge.collibra.dq.metastorePass=<your-postgres-password>

The snippet below is the same as the code block above.

The bold sections are the areas you will edit

sudo /home/<your-directory>/install-master.sh --storage-path /var/edge properties.yaml -r registries.yaml --set collibra_edge.collibra.dq.enabled=true,collibra_edge.collibra.dq.targetRevision=<dq-version>,collibra_edge.collibra.dq.sparkVersion=<spark-version>,collibra_edge.collibra.dq.metastoreUrl=jdbc:postgresql://<postgres-ip>:<postgres-port>/<postgres_database_name>,collibra_edge.collibra.dq.metastoreUser=<postgres-user>,collibra_edge.collibra.dq.metastorePass=<postgres-password>

Warning The $ symbol is not a supported special character in your Postgres Metastore password.

Note Because Collibra DQ automatically derives its version from the Collibra Data Intelligence Cloud version, you do not need to specify the version of Collibra DQ for successful automatic upgrades.

Check that all the processes are running / completed

Copy
sudo /usr/local/bin/kubectl get pods --all-namespaces

Your Edge site will appear as HEALTHY upon successful installation

Uninstall Edge if there were mistakes or typos in the process

Copy
sudo /usr/local/bin/uninstall-edge.sh --force

Note Do not delete an Edge Site from the UI. You can safely uninstall an Edge Site with this command.

Reinstall the prerequisites if you perform the uninstall

Copy
sudo yum localinstall --skip-broken -y https://rpm.rancher.io/k3s/stable/common/centos/8/noarch/k3s-selinux-1.2.2.el8.noarch.rpm

Warning To avoid orphaned records, do not delete an Edge using the UI.

4. Configure an Agent

Go to the Remote Agent panel in the admin console

Upon completion of the Edge installation, you'll find an agent available from each respective Edge Site. Click the pencil icon to configure the agent.

Change the Default Deploy Mode to Cluster, the Default Masters to K8s and input defaults for resource assignment. Also add freeform append Spark confs as shown here.

Use the spark confs in the code block below.

Copy
-conf spark.kubernetes.executor.limit.cores=1,spark.kubernetes.driver.limit.cores=1

Note The DQ Job (Spark) compute will take place locally on Edge K3s. Increase the size of your VM to vertically scale for more resources (.e.g. 32 cores, RAM, etc.). This is the preferred option in beta. Hadoop compute is supported if customer chooses that path and uses their Dataproc or EMR cluster.

Note  Make note of the agent name that as created. In the following step you will create a connection and select (link) the agent to your connection.

Warning Do not delete an Agent from the UI, to avoid any orphaned records.

5. Set Job Limits

Set max cores to 1 in the job limit settings.

Refer to this link for configuring job limits.

6. Add a Connection

This is the same process of adding a connection found Adding Connectionswith one difference. You will map the connection to your agent upon establishing a connection. This is different than mapping a connection and an agent in the self-hosted application.

Select your target agent using the Target Agent drop-down. This drop-down will populate with existing agents. Here is where you will select the agent name from the previous step.

Afterward, you do not need to assign the connection to the agent. It will be automatically mapped.

Note To map a connection to another agent, you need to re-save the connection and select another agent from the drop-down list.

7. Run a DQ Job

Run a DQ Job to validate the installation. Use the Explorer to onboard a table and check the Jobs page as normal to see the status.

Note  If the DQ Job does not succeed, please check your Agent settings and system prerequisites

Notes

Edge Capability Resource Requirements: If insufficient resources, your capabilities will not perform properly.

Installer: Please beware, downloading new installer will invalidate previous installer.

Volume: /var/lib/rancher/k3s path must have 50gb available

Root access: root access is needed, though future revisions will follow the least privileged user access policies.

The private beta is designed to let customers 1) complete the installation 2) confirm successful DQ jobs can be run and 3) validate their security requirements whereby no sensitive data is stored outside their custody.

Helpful Commands

Copy
# Get all pods running
sudo /usr/local/bin/kubectl get pods --all-namespaces

# Get shell access to pod
sudo /usr/local/bin/kubectl exec -it <dq-web-pod> -n collibra-edge -- bash

# Get shell access to pod
sudo /usr/local/bin/kubectl exec -it collibra-edge-controller-<pod-name> -n collibra-edge -- sh

# Check network connectivity to database
curl telnet://<rds-host>:<port> 

# Delete jobs
sudo /usr/local/bin/kubectl delete pod <pod-name> -n collibra-edge

Known limitations

  • DQ Cloud does not currently support SAML configuration.
  • When reviewing the Completeness Report, new data only displays correctly after you upgrade your Collibra DQ Cloud instance to version 2023.05.2 or later.
  • If an Edge site needs to be reinstalled, you must use the original PostgreSQL metastore database or metastore corruption may occur in Collibra DQ. If necessary, restore the metastore database from a backup before reinstalling. Ensure that the installation command line parameter collibra_edge.collibra.dq.metastoreUrl points to the correct database.
  • Instance Profile is not supported for S3 connections.
  • When using the Completeness Report, data only appears after upgrading to 2023.06 or later.
  • When using the Findings page, you currently cannot drill into a rule break record. While there is no workaround for this limitation, a fix is planned for the 2023.06 release.
  • When using the Findings page, you currently cannot tag job runs as off-peak. This will be fixed in the 2023.07 release.

FAQ

What network access is needed?

  • The Edge Site and Postgres need to communicate with each other.
  • Additionally, logging and heartbeat requires outbound access to several services. Please refer to Edge documentation for specific services that are used.

How can a user check the install?

  • Time: The install should complete in around ~5 minutes; if not, there is likely an issue.
  • Check that the pods
  • sudo /usr/local/bin/kubectl get pods --all-namespaces

Is there a way to get more checks / more logs?

  • sudo /usr/local/bin kubectl describe

How to verify successful install?

  • In your Collibra DQ instance, navigate to the Edge Site Management panel in the Admin Console and confirm a HEALTHY status
  • Support can confirm via Datadog, the edge site will send heartbeats

How to locate my Edge site in Datdog?

  • Send your Edge Site ID to Support to check the health status.

Do customers have access to Datadog?

  • Only Collibra has access to Datadog logging.

Can all my Collibra DQ and other capabilities run on the same Edge Site?

  • There are not technical reasons preventing other capabilities and Collibra DQ from running on the same Edge Site.
  • The guidance for the beta is to have DQ Edge separate from DGC Edge capabilities and simply use two Edge sites.

Are there any limitations with Collibra DQ Cloud in terms of features or functionality?

  • While remote files are supported, local files and uploaded files are not supported due to security restrictions
  • Specific drivers are not available in the beta, though the most common data sources are available.

What are the benefits of installing with Edge vs. a stand-alone, self-hosted application?

  • The primary benefits are managed upgrades, maintenance, and reducing the ownership costs of an entirely self-hosted set of components.
  • In addition, this design allows customers to take advantage of containers and cloud technologies without deep technical skillset requirements.
  • This installation pattern was intentionally developed to not compromise any security requirements and give the customer complete custody of their data.
  • Lastly, this aligns the Collibra architecture standards so support and services teams will benefit from normalized deployment models. In particular, when it comes to installation, configuration, and troubleshooting.

Troubleshooting

When upgrades of DQ Edge sites are required, you can leverage a utility script to update the Edge DQ version without reinstalling the Edge site.

Important 
You must run the override-dq.sh script from a machine that is configured to manage Kubernetes where the DQ Edge site is running. For example, the same user and VM where DQ Edge is installed.