Pullup job sizing recommendations

To ensure optimal performance, efficient resource utilization, and successful job execution, you can fine-tune the job resource settings from the global job limits page. This is especially important when dealing with exceptionally large sets of data.

The following table shows recommendations on how to scale your job in different circumstances.

High-level resource targets

Depending on your data size, your job requires a total target amount of RAM and CPU cores to run efficiently. The following table outlines these high-level capacity targets.

Total rows Total columns Total cores Total RAM
100K 50 2 3 GB
1M 50 3 6 GB
10M 50 10 52 GB

Tip If your table is extremely large, use Parallel JDBC to improve the performance of the data loading stage when the Pullup job runs.

Calculating total resources

The global limits page does not accept "total" inputs, as described in the previous table. Instead, Spark distributes these total resources between a single driver node and multiple executor nodes. The underlying engine allocates resources based on these two formulas:

Example For a table with 10M rows, you can achieve a target of 6 total cores and 16 GB total RAM by configuring 1 driver with 4 GB RAM, plus 2 executors that each have 2 cores and 6 GB RAM.

Recommended settings on the global limits page

To achieve the total targets described in the previous table, use the following baseline settings for the specific fields on the global limits page.

Note The following values are intended for illustrative purposes only.

Total rows Total columns Maximum number of driver cores Maximum driver memory Maximum number of executors Maximum executor cores Maximum executor memory
100K 50 1 2 GB 2 2 4 GB
1M 50 1 2 GB 1 2 4 GB
10M 50 1 4 GB 4 2 16 GB

Additional settings

For the remaining fields on the global limits page, use the following guidelines unless your specific workload dictates otherwise:

What's next