Parallel JDBC
Parallel JDBC is a Spark activity whereby Spark uses a specified column in your dataset to evenly partitions data workloads across worker nodes in a Spark Cluster. With Parallel JDBC enabled, the default OWLAUTOJDBC setting instructs Spark to create two partitions. However, you can optionally override the default partition column to set between 2 and 20 partitions on a specific column. When you run a job with Parallel JDBC, all tasks in your workload execute concurrently and in parallel across each partition.
If the size of your dataset (total rows multiplied by total columns) exceeds maxcellcountforparalleljdbc
, we recommend enabling Parallel JDBC to evenly split the workload for optimal performance. It's important to consider the volume of data when you set the number of partitions. For instance, larger datasets may benefit from more partitions in order to efficiently parallelize the data reading process.
The following data sources support Parallel JDBC for Pullup jobs.
- Amazon Athena
- Amazon Redshift
- Db2
- Dremio
- Hive
- Impala
- MySQL
- Oracle
- PostgreSQL
- Presto
- Qubole
- Snowflake
- SQL Server
- Teradata
Prerequisites
Before you use Parallel JDBC, an Admin (ROLE_ADMIN
) user needs to set maxcellcountforparalleljdbc
in Admin Console Admin Limits. We recommend setting this to any value likely to exceed your organization's maximum dataset size.
Using Parallel JDBC
Steps
- From Explorer connect to a data source, then select a scanning method.
- Click Settings below the Run button to open the Settings modal. The Settings modal appears.
- Click the switch next to Parallel JDBC to enable Parallel JDBC. Override Partition Column and No. of Partitions appear.
- Optionally select the Override Partition Column option, then adjust the following settings.
-
Click Save.
Tip Once you select a scanning method, you can follow steps 3 through 5 at any time during the Explorer process before you run your job.
Setting | Description |
---|---|
Override Partition Column |
Splits the column you select into even segments for Parallel JDBC Spark load execution. Select the checkbox option, then select a column from the dropdown menu. Note The default option when Override Partition Column is selected is OWLAUTOJDBC. When identifying a column to set as the partition column, it's important to choose one with a good distribution of values to avoid data skew. After you choose a partition column, you should also set the |
No. of Partitions |
A partition is an even split of the total number of rows in your record. For large DQ jobs, increasing the number of partitions can improve performance and increase processing efficiency. Additionally, if your data is unevenly distributed you may need to increase the number of partitions accordingly in order to avoid data skew. Drag the slider or enter a value between 2 and 20 in the input field. The default value is 10. Example If the row count of your table is 10 million, set the number of partitions to 10 to divide the record evenly into 10 partitioned blocks of 1 million rows. The job then executes the 10 blocks concurrently in parallel. |
Note While your settings update when you click Save, the command line might not automatically reflect your settings until you navigate to another step in the Explorer workflow from and then return to the Review step.
Command line options
Option | Description |
---|---|
-numpartitions
|
Overrides the number of partitions of the Spark loading configuration and sets the number of partitions for Parallel JDBC. The No. of Partitions setting in the Collibra DQ application controls this flag, and can only be used when If you don't explicitly set this option, the Collibra DQ core internal logic automatically calculates the number of partitions. |
-columnname
|
Specifies which column to partition for Parallel JDBC loading. Set this column to one with a good distribution of values to avoid data skew. |
-upperbound
|
Overrides the upper bound of the Spark loading configuration. This value should be greater than the largest value in the partition column (columnname ). |
-lowerbound
|
Overrides the lower bound of the Spark loading configuration. This value should be less than the smallest value in the partition column (columnname ). |
Known limitations
- Parallel JDBC is not available for jobs created in Pushdown mode.
- Parallel JDBC does not work when there are no rows in your dataset.
- Parallel JDBC does not support dataset aliases in SQL queries.