About Parallel JDBC

Parallel JDBC is a Spark feature that accelerates data processing for Pullup jobs by distributing workloads across multiple worker nodes in a Spark cluster.

You can use Parallel JDBC to:

How Parallel JDBC works

When you enable Parallel JDBC, Spark automatically creates partitions based on a column in your dataset. By default, the AUTO_COLUMN setting creates 2 partitions, but you can configure between 2 and 20 partitions depending on your data volume and processing needs.

Each partition runs independently across different worker nodes in your Spark cluster. This parallel execution means that instead of processing your entire job sequentially through a single connection, multiple threads work simultaneously on different portions of your data.

The system splits your job processing into multiple JDBC connections that execute concurrently. All tasks in your workload run at the same time across each partition, significantly reducing the overall processing time.

When to use Parallel JDBC

Consider the total data volume when deciding whether to use Parallel JDBC. Calculate your data volume by multiplying total rows by total columns. Larger jobs typically benefit more from parallel processing.

Parallel JDBC is most effective when you have:

As a general guideline, ensure the number of partitions is less than or equal to the database's maximum connections divided by other consumers.

Conversely, it's best to avoid using Parallel JDBC when your:

Partition column selection

By default, Parallel JDBC uses AUTO_COLUMN to automatically select an appropriate partition column. However, you can manually specify a column for partitioning to optimize performance for your specific use case.

When manually selecting a partition column, choose columns with these supported data types:

These data types work well for partitioning because they typically distribute data evenly across partitions and allow Spark to create logical boundaries for parallel processing.

Avoid using unsupported data types, such as:

Example: Processing customer transaction data

A data analyst needs to process 5 million customer transaction records from the past year. Without Parallel JDBC, this job runs through a single JDBC connection, taking 45 minutes to complete.

By enabling Parallel JDBC with 4 partitions based on the transaction date column, Spark distributes the workload across 4 worker nodes. Each node processes approximately 1.25 million records simultaneously. The same job now completes in 12 minutes, reducing processing time by over 70%.