Jobserver memory and CPU usage
The most demanding jobs in terms of computing resources are the ingestion and profiling processes. You have to make sure to meet the system requirements to perform ingestion and profiling successfully.
The Jobserver and the Spark Context run in two separate Java Virtual Machines, which means that the memory is shared between them.
We highly recommend you to install the Jobserver on a dedicated server. However, if you install the Jobserver on the same server as other Collibra nodes, the minimum hardware requirements of the Jobserver must be added to those of the other Collibra nodes on the same server.
Ingestion
During the ingestion of a schema, the schema is analyzed by a process of the Jobserver Spark context, split in pages and then sent to the Jobserver page per page. Each page it is stored in memory until Collibra DGC fetches it.
Profiling
The amount of data processed per table is limited to a certain threshold. You can customize this threshold in the Data Governance Center service configuration with a maximum of 10 GB of disk space. The profiling restarts on a subset of the data when that threshold is reached. It extracts a random subset of an approximate size defined by the threshold. This gives an upper estimation of the largest data set the Spark context may have to process. Taking into account that data size in memory is larger than on disk, we consider a heap size of 40 GB.
Warning
CPU usage
In the spark
section of the jobserver.conf file, located in /opt/collibra/spark-jobserver/conf/, the local[N]
parameter determines how many CPUs can be used by Jobserver Spark for profiling. The original setting (local[*]
) enables the usage of all the CPUs available to the machine. We recommend keeping the original settings for best performance.