Cloud
Collibra DQ Cloud is delivered as a Software-as-a-Service offering. This section describes the installation process and details the Edge component for Collibra DQ Cloud.
Requirements
To use Collibra DQ Cloud, you must ensure that the following system requirements are met.
Resource | Notes | Provisioned by |
---|---|---|
Collibra DQ | Version 2022.02 or newer with Edge mode enabled. | Collibra |
Collibra Edge Site | Version 2022.02 or newer. | Customer |
Postgres | Version 11 or newer. | Customer |
Diagram
The following diagram shows the role of Edge and how it connects to Collibra DQ. Edge connects to your data sources and then provides data from them back to Collibra DQ, where you can scan and profile your data for quality issues.
Prerequisites
You need the following minimum hardware and software requirements:
VM
This is where your Edge site is installed. You need:
- RedHat 8 or CentOS 8
- SSH access
- 55 GB of free storage
- 64 GB memory
- 16 cores
- Egress (outbound) network access on port 443
- Network access to Postgres installed in step 2
Run the following command to verify:
df -B G
Run the following command to verify:
free -m
Run the following command to verify:
cat /proc/cpuinfo
Run the following command to verify:
wget https://google.com
Run the following command to verify:
telnet 10.64.2.3 5432
Note For medium to large workloads of more than 100M rows by 100 columns, we recommend that your VM has a minimum of 32 cores, 128 GB memory, and 500 GB of free storage.
Edge installation requirements can be found here.
Postgres
This is where your DQ Job results are stored. You need:
- Version 11 or newer
- A minimum of 100 GB of free storage
- A minimum of 4 cores
- Network access to and from the VM where Edge is installed
- User with ownership rights over the target database
Steps
1. Obtain a Secure Collibra DQ Web URL
This is provisioned by Collibra. Along with the URL, credentials will be provided to access your instance.
2. Install Postgres
This is provisioned by the customer. There are several ways to install Postgres. You should follow your existing company process to provision a Postgres instance, such as RDS, Azure SQL, Cloud SQL, or standard install with a package manager. Ensure your version is 11 or newer.
Important Remember your Postgres IP and login credentials. This is required when deploying the Edge site.
3. Install Edge
Refer to Edge documentation for system requirements.
Navigate to the Edge Site Management panel in the Admin Console
Add an Edge Site and provide a name and description
Using the Actions drop-down, download the Edge installer package locally
Warning Because connections must have an exact relationship between the Edge site and the datasource hostname, do not delete your Edge Site from the Edge Site Management page.
Upload the Edge installer package to your VM that meets the Edge system requirements above. An example scp command is below, but you can do this several ways.
scp -i ~/Downloads/vm-key.pem ~/Downloads/<installer>.tgz user@<host-or-ip>:/home/user/<installer>.tgz
SSH to your VM after uploading the installer package. Untar the .tgz
tar -xvf <installer>.tgz
Install prerequisite Edge packages.
sudo yum install -y container-selinux selinux-policy-base
sudo yum install -y sudo yum install -y https://rpm.rancher.io/k3s/stable/common/centos/8/noarch/k3s-selinux-1.2.2.el8.noarch.rpm
sudo firewall-cmd --zone=trusted --add-interface=lo --permanent
sudo firewall-cmd --zone=trusted --add-interface=cni0 --permanent
sudo firewall-cmd --reload
Confirm you have the correct Collibra DQ version pointer from your Cloud instance, for example 2023.01-ABDGCSHILM-2095.
Remember your Postgres IP and credentials from the previous step.
Install Edge w/ DQ w/ the correct parameters
sudo /home/centos/install-master.sh --storage-path /var/edge properties.yaml -r registries.yaml --set collibra_edge.collibra.dq.enabled=true,collibra_edge.collibra.dq.targetRevision=2023.01-ABDGCSHILM-2095,collibra_edge.collibra.dq.sparkVersion=3.4.1,collibra_edge.collibra.dq.metastoreUrl=jdbc:postgresql://<your-postgres-ip>:5432/postgres,collibra_edge.collibra.dq.metastoreUser=<your-postgres-user>,collibra_edge.collibra.dq.metastorePass=<your-postgres-password>
The snippet below is the same as the code block above.
The bold sections are the areas you will edit
sudo /home/<your-directory>/install-master.sh --storage-path /var/edge properties.yaml -r registries.yaml --set collibra_edge.collibra.dq.enabled=true,collibra_edge.collibra.dq.targetRevision=<dq-version>,collibra_edge.collibra.dq.sparkVersion=<spark-version>,collibra_edge.collibra.dq.metastoreUrl=jdbc:postgresql://<postgres-ip>:<postgres-port>/<postgres_database_name>,collibra_edge.collibra.dq.metastoreUser=<postgres-user>,collibra_edge.collibra.dq.metastorePass=<postgres-password>
Warning The $
symbol is not a supported special character in your Postgres Metastore password.
Note Because Collibra DQ automatically derives its version from the Collibra Data Intelligence Platform version, you do not need to specify the version of Collibra DQ for successful automatic upgrades.
Check that all the processes are running / completed
sudo /usr/local/bin/kubectl get pods --all-namespaces
Your Edge site will appear as HEALTHY upon successful installation
Uninstall Edge if there were mistakes or typos in the process
sudo /usr/local/bin/uninstall-edge.sh --force
Note Do not delete an Edge Site from the UI. You can safely uninstall an Edge Site with this command.
Reinstall the prerequisites if you perform the uninstall
sudo yum localinstall --skip-broken -y https://rpm.rancher.io/k3s/stable/common/centos/8/noarch/k3s-selinux-1.2.2.el8.noarch.rpm
Warning To avoid orphaned records, do not delete an Edge using the UI.
4. Configure an Agent
Go to the Remote Agent panel in the admin console
Upon completion of the Edge installation, you'll find an agent available from each respective Edge Site. Click the pencil icon to configure the agent.
Change the Default Deploy Mode to Cluster, the Default Masters to K8s and input defaults for resource assignment. Also add freeform append Spark confs as shown here.
Use the spark confs in the code block below.
-conf spark.kubernetes.executor.limit.cores=1,spark.kubernetes.driver.limit.cores=1
Note The DQ Job (Spark) compute will take place locally on Edge K3s. Increase the size of your VM to vertically scale for more resources (.e.g. 32 cores, RAM, etc.). This is the preferred option in beta. Hadoop compute is supported if customer chooses that path and uses their Dataproc or EMR cluster.
Note Make note of the agent name that as created. In the following step you will create a connection and select (link) the agent to your connection.
Warning Do not delete an Agent from the UI, to avoid any orphaned records.
5. Set Job Limits
Set max cores to 1 in the job limit settings.
Refer to this link for configuring job limits.
6. Add a Connection
This is the same process of adding a connection found Adding Connectionswith one difference. You will map the connection to your agent upon establishing a connection. This is different than mapping a connection and an agent in the self-hosted application.
Select your target agent using the Target Agent drop-down. This drop-down will populate with existing agents. Here is where you will select the agent name from the previous step.
Afterward, you do not need to assign the connection to the agent. It will be automatically mapped.
Note To map a connection to another agent, you need to re-save the connection and select another agent from the drop-down list.
7. Run a DQ Job
Run a DQ Job to validate the installation. Use the Explorer to onboard a table and check the Jobs page as normal to see the status.
Note If the DQ Job does not succeed, please check your Agent settings and system prerequisites
Notes
Edge Capability Resource Requirements: If insufficient resources, your capabilities will not perform properly.
Installer: Please beware, downloading new installer will invalidate previous installer.
Volume: /var/lib/rancher/k3s path must have 50gb available
Root access: root access is needed, though future revisions will follow the least privileged user access policies.
The private beta is designed to let customers 1) complete the installation 2) confirm successful DQ jobs can be run and 3) validate their security requirements whereby no sensitive data is stored outside their custody.
Helpful Commands
# Get all pods running
sudo /usr/local/bin/kubectl get pods --all-namespaces
# Get shell access to pod
sudo /usr/local/bin/kubectl exec -it <dq-web-pod> -n collibra-edge -- bash
# Get shell access to pod
sudo /usr/local/bin/kubectl exec -it collibra-edge-controller-<pod-name> -n collibra-edge -- sh
# Check network connectivity to database
curl telnet://<rds-host>:<port>
# Delete jobs
sudo /usr/local/bin/kubectl delete pod <pod-name> -n collibra-edge
Known limitations
- DQ Cloud does not currently support SAML configuration.
- When reviewing the Completeness Report, new data only displays correctly after you upgrade your Collibra DQ Cloud instance to version 2023.05.2 or later.
- If an Edge site needs to be reinstalled, you must use the original PostgreSQL metastore database or metastore corruption may occur in Collibra DQ. If necessary, restore the metastore database from a backup before reinstalling. Ensure that the installation command line parameter
collibra_edge.collibra.dq.metastoreUrl
points to the correct database. - Instance Profile is not supported for S3 connections.
- When using the Completeness Report, data only appears after upgrading to 2023.06 or later.
- When using the Findings page, you currently cannot drill into a rule break record. While there is no workaround for this limitation, a fix is planned for the 2023.06 release.
- When using the Findings page, you currently cannot tag job runs as off-peak. This will be fixed in the 2023.07 release.
FAQ
What network access is needed?
- The Edge Site and Postgres need to communicate with each other.
- Additionally, logging and heartbeat requires outbound access to several services. Please refer to Edge documentation for specific services that are used.
How can a user check the install?
- Time: The install should complete in around ~5 minutes; if not, there is likely an issue.
- Check that the pods
sudo /usr/local/bin/kubectl get pods --all-namespaces
Is there a way to get more checks / more logs?
sudo /usr/local/bin kubectl describe
How to verify successful install?
- In your Collibra DQ instance, navigate to the Edge Site Management panel in the Admin Console and confirm a HEALTHY status
- Support can confirm via Datadog, the edge site will send heartbeats
How to locate my Edge site in Datdog?
- Send your Edge Site ID to Support to check the health status.
Do customers have access to Datadog?
- Only Collibra has access to Datadog logging.
Can all my Collibra DQ and other capabilities run on the same Edge Site?
- There are not technical reasons preventing other capabilities and Collibra DQ from running on the same Edge Site.
- The guidance for the beta is to have DQ Edge separate from DGC Edge capabilities and simply use two Edge sites.
Are there any limitations with Collibra DQ Cloud in terms of features or functionality?
- While remote files are supported, local files and uploaded files are not supported due to security restrictions
- Specific drivers are not available in the beta, though the most common data sources are available.
What are the benefits of installing with Edge vs. a stand-alone, self-hosted application?
- The primary benefits are managed upgrades, maintenance, and reducing the ownership costs of an entirely self-hosted set of components.
- In addition, this design allows customers to take advantage of containers and cloud technologies without deep technical skillset requirements.
- This installation pattern was intentionally developed to not compromise any security requirements and give the customer complete custody of their data.
- Lastly, this aligns the Collibra architecture standards so support and services teams will benefit from normalized deployment models. In particular, when it comes to installation, configuration, and troubleshooting.
Troubleshooting
When upgrades of DQ Edge sites are required, you can leverage a utility script to update the Edge DQ version without reinstalling the Edge site.
Important
You must run the override-dq.sh script from a machine that is configured to manage Kubernetes where the DQ Edge site is running. For example, the same user and VM where DQ Edge is installed.
- Download the override-dq.sh script.
- Open a terminal session and execute the following shell command:
- Choose one of the following options:
- (Recommended) Remove the setting that allows your Edge site to update according to the DQ Cloud version.
- Set the version to a specific Collibra DQ version.
- Click in the upper right corner of the application, then copy the App Version, for example, 2023.08-ABDGCSHILM-3280.
- Paste the copied App Version into the version.
Copy# Run the script
$ ./override-dq.sh
# Update the target DQ version to set at the Edge site
--dq-target-version <20XX.XX-ABDGCSHILM-XXX>Tip You can use either the
--help
or-h
command to display the following message:Note Either --dq-target-version or --unset should be set
.
Copy# Run the script
$ ./override-dq.sh
# Unset the DQ service version override at the Edge site
--unset
# Give your .sh file execute permission
chmod +x override-dq.sh
# Run the script
./override-dq.sh