Architecture Diagram

Collibra DQ Architecture

CDQ architectural concept diagram depicting the flow from the control plane to the compute plane to the data plane

High-Level Diagram

Collibra DQ Architecture

  1. Connect to data sources.
  2. Build the DQ scan algorithm and submit the job.
  3. Execute the Spark job.
  4. Write the DQ results in the Metastore.
  5. Browse the results of the DQ Scan with the management console.

Collibra DQ Hadoop Deployment Diagram

Collibra DQ High-Level Diagram


Collibra DQ Kubernetes Deployment Diagram

Collibra DQ Kubernetes Diagram

Note For Kubernetes deployments of Collibra DQ should use Auto Scaling and Spot instances to further increase efficiency and reduce cost.

Collibra DQ Standalone

Collibra DQ Standalone Architecture

The image above depicts owl-web, owl-core, Postgres and orient all deployed on the same server. This can be an edge node of a Hadoop cluster or a server that has access to run Spark-submit jobs to the Hadoop cluster. This server could also have JDBC access to other DB engines interested in being quality scanned by Collibra Data Quality & Observability. Looking at this depiction from left to right the client uses their browser to connect to Collibra DQ's Web Application running on the default port 9000. The Collibra DQ Web Application communicates with the metastore.. The Web Application can run a local DQ check, or the Data script can be launched from the CLI natively. The DQ check launches a job using Collibra DQ’s built in Spark Local DQ Engine. Depending on the options supplied to the DQ check command, the Collibra DQ can scan a file or database with JDBC connectivity.

Collibra DQ Distributed

Collibra DQ Distributed Architecture

The image above depicts owl-web and owl-core deployed on different servers. In this example Owl-web is NOT deployed on the edge node. Owl-core is installed on the edge node and writes DQ check results back to the metastore that the DQ Web App points to.

Note In this scenario, the metastore and the web-app run on the same host.

The other change is that the DQ check distributes the work on top of a Hadoop cluster to leverage Spark and use the parallel processing that comes with the Hadoop engine.