System Requirements
Before you install Collibra Data Quality & Observability, you need all of the following information to ensure an easy, successful installation process. This section only focuses on the requirements of Collibra DQ and does not take into account the connections to the data sources to ingest data.
Supported Web Browsers
Browser | Version |
---|---|
Google Chrome (recommended) | 70.0.3538.102 or newer |
Mozilla Firefox | 52.8.0 or newer |
Safari | 12.0.1 or newer |
Encryption
Java Cryptography Extension (JCE) Unlimited Strength Jurisdiction
PostgreSQL version
- Collibra Data Quality & Observability comes prepackaged with version 11.4 of PostgreSQL.
- Version 9.6.5 and above is supported if wanting to use an external metastore.
Collibra recommends installing the DQ Metastore in an external PostgreSQL metastore.
Installation packages/files (BOM)
- demoscripts.tar.gz
- log4j*
- owlcheck
- owl-core-2.1.0-jar-with-dependencies.jar
- owl-webapp-2.1.0.jar
- owl-agent-2.1.0.jar
- setup.sh
- owl-postgres.tar.gz
- notebooks.tar.gz
- owlmanage.sh
Installation-specific requirements
- Standalone
- Cloud Native
User permissions
Log in with a user account that has privileges to:
- Create directories
- Launch scripts
- Start java, Spark, and Collibra DQ processes
SUDO
is required if you are including the PostgreSQL metastore in the installation. SUDO
is not required if you are using an external PostgreSQL metastore (recommended).
In addition, configure ULIMIT
settings to 4096 or higher. DQ services typically consume approximately 428 threads, and each DQ job consumes an additional 400 threads. Setting ULIMIT
to 4096 allows for approximately nine concurrent DQ jobs on a Standalone install.
System requirements
Supported operating systems
- Red Hat Enterprise Linux 8.x
- Red Hat Enterprise Linux 9.x
Hardware requirements
Small Tier - 16 Core, 128G RAM (r5.4xlarge / E16s v3)
Component | RAM | Cores |
---|---|---|
Web | 2g | 2 |
Postgres | 2g | 2 |
Spark | 100g | 10 |
Overhead | 10g | 2 |
Medium Tier - 32 Core, 256G RAM (r5.8xlarge / E32s v3)
Component | RAM | Cores |
---|---|---|
Web | 2g | 2 |
Postgres | 2g | 2 |
Spark | 250g | 26 |
Overhead | 10g | 2 |
Large Tier - 64 Core, 512G RAM (r5.16xlarge / E64s v3)
Component | RAM | Cores |
---|---|---|
Web | 4g | 3 |
Postgres | 4g | 3 |
Spark | 486g | 54 |
Overhead | 18g | 4 |
Important Collibra DQ requires a limit of 2TBs for large tier jobs. For DQ jobs that exceed 2TBs, you must filter down columns or rows.
Estimates
Sizing should allow headroom and based on peak concurrency and peak volume requirements. If concurrency is not a requirement, you just need to size for peak volume (largest tables). Best practice to efficiently scan is to scope the job by selecting critical columns. See Scaling your DQ Job for more information.
Bytes per Cell | Rows | Columns | Gigabytes | Gigabytes for Spark (3x) |
---|---|---|---|---|
16 | 1,000,000.00 | 25 | 0.4 | 1.2 |
16 | 10,000,000.00 | 25 | 4 | 12 |
16 | 100,000,000.00 | 25 | 40 | 120 |
16 | 1,000,000.00 | 50 | 0.8 | 2.4 |
16 | 10,000,000.00 | 50 | 8 | 24 |
16 | 100,000,000.00 | 50 | 80 | 240 |
16 | 1,000,000.00 | 100 | 1.6 | 4.8 |
16 | 10,000,000.00 | 100 | 16 | 48 |
16 | 1,000,000,000.00 | 100 | 1600 | 4800 |
16 | 100,000,000.00 | 100 | 160 | 480 |
16 | 1,000,000.00 | 200 | 3.2 | 9.6 |
16 | 10,000,000.00 | 200 | 32 | 96 |
16 | 100,000,000.00 | 200 | 320 | 960 |
16 | 1,000,000,000.00 | 200 | 3200 | 9600 |
Cluster
If your program requires more horsepower or (Spark) workers than the example tiers above which is fairly common in Fortune 500 companies than you should consider the horizontal and ephemeral scale of a cluster. Common examples include Amazon EMR and Cloudera CDP. Collibra DQ is built to scale up horizontally and can scale to hundreds of nodes.
Network requirements
Default Ports used by Collibra DQ
- 5432 – PostgreSQL
- 9000 – DQ Web
- 9101 – Exposes the Health Check API to check that the DQ Agent is running and stable.
Other
- If your current Spark version is 3.2.2 or older, Collibra strongly recommends upgrading to Spark 3.4.1 to address various critical vulnerabilities present in Spark core library, including Log4J. To determine which Spark version you are using, sign into your Collibra DQ instance and click the
in the upper-right corner of any page. The Spark Version lists your current Spark version.
- If you are not already using Spark 3.4.1, follow the steps outlined in Upgrading Spark versions.
User permissions
Prerequisites
- Kubernetes cluster -- EKS, GKE, AKS, Openshift, Rancher
- Helm(v3)
- kubectl
- Cloud command line SDK, such as gcloud CLI, AWS CLI or similar
- External PostgreSQL DB version 11.9 and above, storage size 100GB, cores 4 to 8 memory to 4 to 8 GB
- Private container registry -- to store images
- LoadBalancer -- IngressController -- Ingress
- Egress networking access
- Helm Chart
- Images, image access key
- Minimum pod requirement -- 2 cores, 2GB RAM
- If you bring in your own Spark executor pod launch template, ensure that the service account used to launch Spark executor pods has the permission to do so. Refer to the executor launch template for more information.
System requirements
Supported Kubernetes versions
Collibra Data Quality & Observability supports Kubernetes versions 1.29 through 1.31.
Note As of February 2025, we recommend upgrading to Kubernetes version 1.29 or newer, as version 1.28 has reached its end of life.
Application system requirements
Component | Processor | Memory | Storage |
---|---|---|---|
Collibra DQ Web | 1 core | 2 GB | 10 MB PVC |
DQ Agent | 1 core | 1 GB | 100 MB PVC |
DQ Metastore | 1 core | 2 GB | 10 GB PVC |
Spark* | 2 cores | 2 GB | - |
Note * This is the minimum quantity of resources required to run an a Spark job in Kubernetes. This amount of resources would only provide the ability to scan a few megabytes of data with no more than a single job running at a given time. Proper sizing of the compute space must take into account the largest dataset that may be scanned, as well as the desired concurrency.
Network service considerations
DQ Web is the only required component that needs to be directly accessed from outside of Kubernetes. History Server is the only other component that can be accessed directly by users, however, it is optional.
If the target Kubernetes platform supports a LoadBalancer service type, you can configure the Helm Chart to directly deploy the externally accessible endpoint.
Note For testing purposes, you can also configure the Helm chart to deploy a NodePort service type.
For the Ingress service type, deploy Collibra DQ without an externally accessible service and then attach the Ingress service separately. This applies when you use a third-party Ingress controller such as NGINX, Contour, etc.
Note The Helm Chart is able to deploy an Ingress on GKE and EKS platforms, however, there is a wide variety of possible Ingress configurations that have not been tested.