Pullup job sizing recommendations
To ensure optimal performance, efficient resource utilization, and successful job execution, you can fine-tune the job resource settings from the global job limits page. This is especially important when dealing with exceptionally large sets of data.
The following table shows recommendations on how to scale your job in different circumstances.
High-level resource targets
Depending on your data size, your job requires a total target amount of RAM and CPU cores to run efficiently. The following table outlines these high-level capacity targets.
| Total rows | Total columns | Total cores | Total RAM |
|---|---|---|---|
| 100K | 50 | 2 | 3 GB |
| 1M | 50 | 3 | 6 GB |
| 10M | 50 | 10 | 52 GB |
Tip If your table is extremely large, use Parallel JDBC to improve the performance of the data loading stage when the Pullup job runs.
Calculating total resources
The global limits page does not accept "total" inputs, as described in the previous table. Instead, Spark distributes these total resources between a single driver node and multiple executor nodes. The underlying engine allocates resources based on these two formulas:
- Total cores = Driver cores + (Number of executors × Executor cores)
- Total RAM = Driver memory + (Number of executors × Executor memory)
Example For a table with 10M rows, you can achieve a target of 6 total cores and 16 GB total RAM by configuring 1 driver with 4 GB RAM, plus 2 executors that each have 2 cores and 6 GB RAM.
Recommended settings on the global limits page
To achieve the total targets described in the previous table, use the following baseline settings for the specific fields on the global limits page.
Note The following values are intended for illustrative purposes only.
| Total rows | Total columns | Maximum number of driver cores | Maximum driver memory | Maximum number of executors | Maximum executor cores | Maximum executor memory |
|---|---|---|---|---|---|---|
| 100K | 50 | 1 | 2 GB | 2 | 2 | 4 GB |
| 1M | 50 | 1 | 2 GB | 1 | 2 | 4 GB |
| 10M | 50 | 1 | 4 GB | 4 | 2 | 16 GB |
Additional settings
For the remaining fields on the global limits page, use the following guidelines unless your specific workload dictates otherwise:
- Maximum worker cores: Set this equal to or slightly higher than your executor cores.
- Maximum worker memory: Set this equal to or slightly higher than your executor memory to allow for system overhead.
- Ensure that your Edge or Collibra Cloud site meets all system requirements.
- Create a Pullup job.