Parallel JDBC

Parallel JDBC is a Spark activity whereby Spark uses a specified column in your dataset to evenly partitions data workloads across worker nodes in a Spark Cluster. With Parallel JDBC enabled, the default OWLAUTOJDBC setting instructs Spark to create two partitions. However, you can optionally override the default partition column to set between 2 and 20 partitions on a specific column. When you run a job with Parallel JDBC, all tasks in your workload execute concurrently and in parallel across each partition.

If the size of your dataset (total rows multiplied by total columns) exceeds maxcellcountforparalleljdbc, we recommend enabling Parallel JDBC to evenly split the workload for optimal performance. It's important to consider the volume of data when you set the number of partitions. For instance, larger datasets may benefit from more partitions in order to efficiently parallelize the data reading process.

The following data sources support Parallel JDBC for Pullup jobs.

  • Amazon Athena
  • Amazon Redshift
  • Db2
  • Dremio
  • Hive
  • Impala
  • MySQL
  • Oracle
  • PostgreSQL
  • Presto
  • Qubole
  • Snowflake
  • SQL Server
  • Teradata

Prerequisites

Before you use Parallel JDBC, an Admin (ROLE_ADMIN) user needs to set maxcellcountforparalleljdbc in Admin Console Admin Limits. We recommend setting this to any value likely to exceed your organization's maximum dataset size.

Using Parallel JDBC

Steps

  1. From Explorer connect to a data source, then select a scanning method.
  2. Tip Once you select a scanning method, you can follow steps 3 through 5 at any time during the Explorer process before you run your job.

  3. Click Settings below the Run button to open the Settings modal.
  4. The Settings modal appears.
  5. Click the switch next to Parallel JDBC to enable Parallel JDBC.
  6. Override Partition Column and No. of Partitions appear.
  7. Optionally select the Override Partition Column option, then adjust the following settings.
  8. Setting Description
    Override Partition Column

    Splits the column you select into even segments for Parallel JDBC Spark load execution.

    Select the checkbox option, then select a column from the dropdown menu.

    Note The default option when Override Partition Column is selected is OWLAUTOJDBC.

    When identifying a column to set as the partition column, it's important to choose one with a good distribution of values to avoid data skew. After you choose a partition column, you should also set the lowerbound and upperbound command line options according to the range of values within the column.

    No. of Partitions

    A partition is an even split of the total number of rows in your record. For large DQ jobs, increasing the number of partitions can improve performance and increase processing efficiency. Additionally, if your data is unevenly distributed you may need to increase the number of partitions accordingly in order to avoid data skew.

    Drag the slider or enter a value between 2 and 20 in the input field. The default value is 10.

    Example If the row count of your table is 10 million, set the number of partitions to 10 to divide the record evenly into 10 partitioned blocks of 1 million rows. The job then executes the 10 blocks concurrently in parallel.

  9. Click Save.

Note While your settings update when you click Save, the command line might not automatically reflect your settings until you navigate to another step in the Explorer workflow from and then return to the Review step.

Command line options

Option Description
-numpartitions

Overrides the number of partitions of the Spark loading configuration and sets the number of partitions for Parallel JDBC. The No. of Partitions setting in the Collibra DQ application controls this flag, and can only be used when columnname is set to a specific column, not to the default OWLAUTOJDBC.

If you don't explicitly set this option, the Collibra DQ core internal logic automatically calculates the number of partitions.

-columnname

Specifies which column to partition for Parallel JDBC loading. Set this column to one with a good distribution of values to avoid data skew.

-upperbound Overrides the upper bound of the Spark loading configuration. This value should be greater than the largest value in the partition column (columnname).
-lowerbound Overrides the lower bound of the Spark loading configuration. This value should be less than the smallest value in the partition column (columnname).

Known limitations

  • Parallel JDBC is not available for jobs created in Pushdown mode.
  • Parallel JDBC does not work when there are no rows in your dataset.
  • Parallel JDBC does not support dataset aliases in SQL queries.