About Parallel JDBC
Parallel JDBC is a Spark feature that accelerates data processing for Pullup jobs by distributing workloads across multiple worker nodes in a Spark cluster.
You can use Parallel JDBC to:
- Reduce processing time for large data volumes
- Optimize resource utilization across your Spark cluster
- Improve job performance through concurrent data processing
- Scale data operations based on your specific workload requirements
How Parallel JDBC works
When you enable Parallel JDBC, Spark automatically creates partitions based on a column in your dataset. By default, the AUTO_COLUMN setting creates 2 partitions, but you can configure between 2 and 20 partitions depending on your data volume and processing needs.
Each partition runs independently across different worker nodes in your Spark cluster. This parallel execution means that instead of processing your entire job sequentially through a single connection, multiple threads work simultaneously on different portions of your data.
The system splits your job processing into multiple JDBC connections that execute concurrently. All tasks in your workload run at the same time across each partition, significantly reducing the overall processing time.
When to use Parallel JDBC
Consider the total data volume when deciding whether to use Parallel JDBC. Calculate your data volume by multiplying total rows by total columns. Larger jobs typically benefit more from parallel processing.
Parallel JDBC is most effective when you have:
- Large jobs that take significant time to process sequentially, such as reading large tables with millions of rows where a single partition constrains performance.
- Available worker nodes in your Spark cluster to distribute the load across executors.
- Data that can be logically partitioned by a specific column that is indexed, roughly uniform, and monotonically increasing, such as an ID or timestamp.
- Performance requirements that justify the additional resource overhead.
- A source database that can handle concurrent connections and has available capacity in its connection pool, CPU, and I/O.
As a general guideline, ensure the number of partitions is less than or equal to the database's maximum connections divided by other consumers.
Conversely, it's best to avoid using Parallel JDBC when your:
- Job is small (less than a few million rows), as a single partition is sufficient and the overhead of parallelization is unnecessary.
- Job does not have a suitable partition column, which can force range splits on non-indexed columns and cause full table scans per partition.
- Target column is highly skewed, meaning some partitions receive millions of rows while others receive none.
- Data source is already under heavy load, meaning extra connections will negatively impact performance or cause timeouts.
- Query involves reading aggregated or filtered results; instead, push the query down and read the smaller result set with one partition.
Partition column selection
By default, Parallel JDBC uses AUTO_COLUMN to automatically select an appropriate partition column. However, you can manually specify a column for partitioning to optimize performance for your specific use case.
When manually selecting a partition column, choose columns with these supported data types:
- Numeric (INT, BIGINT, DOUBLE, etc.): Best choice because uniform distribution is easy to assess and range math is straightforward.
- Date (DATE): Works well for time-series data because Spark splits by date range.
- Timestamp (TIMESTAMP): Good for event or log tables with time-based partitioning.
These data types work well for partitioning because they typically distribute data evenly across partitions and allow Spark to create logical boundaries for parallel processing.
Avoid using unsupported data types, such as:
- String, VARCHAR, or TEXT
- Boolean
- Binary or BLOB
- Array, JSON, or Struct
- UUID
Example: Processing customer transaction data
A data analyst needs to process 5 million customer transaction records from the past year. Without Parallel JDBC, this job runs through a single JDBC connection, taking 45 minutes to complete.
By enabling Parallel JDBC with 4 partitions based on the transaction date column, Spark distributes the workload across 4 worker nodes. Each node processes approximately 1.25 million records simultaneously. The same job now completes in 12 minutes, reducing processing time by over 70%.