CDQ + Databricks

Introduction

This document provides how to guidance to help you to upload and add CDQ jars to a Databricks cluster and to run a CDQ job by invoking CDQ APIs (aka activities).

Design

 

The above gif shows the workflow of running CDQ jobs from Scala and Pyspark notebooks.

CDQ Environment Setup

In this section, we explain the steps involved in setting up your CDQ environment in Databricks. This is the first step towards invoking CDQ APIs in Databricks.

Step 1: Upload CDQ Core jar to Databricks

Extract the core jar from owl package zipped file.

The first step is to get the CDQ jar file. Once you have the cdq jar package file, you can get the jars by running the following commands:

tar -xvf package.tar.gz

Example tar -xvf owl-2022.04-RC1-default-package-base.tar.gz

Running this command instructs tar to extract the files from the zipped file. From the list of the files, you need to upload the owl-core-xxxx-jar-with-dependancies.jar to our Databricks file system which will be explained in the next section.

 

The above image shows how to extract owl jar files from owl package zipped file.

Step 2: Upload the file to Databricks file system using UI

The jars should be manually uploaded in Databricks file system. Below is the quick summary of the steps. You can find more details about upload files in Databricks page:

https://docs.databricks.com/data/databricks-file-system.html

https://docs.databricks.com/data/databricks-file-system.html#access-dbfs

  1. Login to your Databricks account.
  2. Click Datain the sidebar.
  3. Click the DBFS button at the top of the page.
  4. Upload the owl-core-xxxx-jar-with-dependencies.jar to your desired path.

 

Step 3: Install CDQ library in your Databricks cluster

 

Once this step is completed, you can create a workspace and start using CDQ APIs.

Step 4 (Optional): Update datasource pool size

This step is only necessary if you get PoolExhaustedException when you call CDQ APIs.

To solve the issue you can simply update the connection pool size in the spark environment.

SPRING_DATASOURCE_POOL_MAX_WAIT=500

SPRING_DATASOURCE_POOL_MAX_SIZE=30

SPRING_DATASOURCE_POOL_INITIAL_SIZE=5

Here is the documentation from Databricks about how to set up environment variables:

https://docs.databricks.com/clusters/configure.html#environment-variables