CDQ + Databricks
Introduction
This document provides how to guidance to help you to upload and add CDQ jars to a Databricks cluster and to run a CDQ job by invoking CDQ APIs (aka activities).
Design
The above gif shows the workflow of running CDQ jobs from Scala and Pyspark notebooks.
CDQ Environment Setup
In this section, we explain the steps involved in setting up your CDQ environment in Databricks. This is the first step towards invoking CDQ APIs in Databricks.
Step 1: Upload CDQ Core jar to Databricks
Extract the core jar from owl package zipped file.
The first step is to get the CDQ jar file. Once you have the cdq jar package file, you can get the jars by running the following commands:
tar -xvf package.tar.gz
Example tar -xvf owl-2022.04-RC1-default-package-base.tar.gz
Running this command instructs tar to extract the files from the zipped file. From the list of the files, you need to upload the owl-core-xxxx-jar-with-dependancies.jar to our Databricks file system which will be explained in the next section.
The above image shows how to extract owl jar files from owl package zipped file.
Step 2: Upload the file to Databricks file system using UI
The jars should be manually uploaded in Databricks file system. Below is the quick summary of the steps. You can find more details about upload files in Databricks page:
https://docs.databricks.com/data/databricks-file-system.html
https://docs.databricks.com/data/databricks-file-system.html#access-dbfs
- Login to your Databricks account.
- Click Datain the sidebar.
- Click the DBFS button at the top of the page.
- Upload the owl-core-xxxx-jar-with-dependencies.jar to your desired path.
Step 3: Install CDQ library in your Databricks cluster
Once this step is completed, you can create a workspace and start using CDQ APIs.
Step 4 (Optional): Update datasource pool size
This step is only necessary if you get PoolExhaustedException when you call CDQ APIs.
To solve the issue you can simply update the connection pool size in the spark environment.
SPRING_DATASOURCE_POOL_MAX_WAIT=500
SPRING_DATASOURCE_POOL_MAX_SIZE=30
SPRING_DATASOURCE_POOL_INITIAL_SIZE=5
Here is the documentation from Databricks about how to set up environment variables:
https://docs.databricks.com/clusters/configure.html#environment-variables