Jobserver service

Architecture

The Jobserver is used to ingest data and to execute data profiling or to create sample data on the ingested data. You can ingest data when you register a data source.

It is an application that relies on Apache Spark to perform CPU and memory intensive computations quickly and efficiently. More specifically, the Jobserver acts as an interface between the Collibra Platform service and Spark, sending Spark job execution requests through a REST interface. The Jobserver also provides control over the single Spark jobs and the data used by Spark.

When running a profiling operation, the Jobserver starts a new Java Virtual Machine (JVM), running a Spark Context. The profiling operations are executed within this JVM and returns to the Collibra service through the main Jobserver application.

Only one profiling operation can run at a time. If there are several profiling operations, they are queued for execution.

Data storage

The Jobserver must be installed on a dedicated server and is managed by Collibra Console through an agent.

The data of the Jobserver service is located in:

Linux with root permission: /opt/collibra_data/spark-jobserver
Linux without root permission: ~/collibra_data/spark-jobserver
Windows: C:\collibra_data\spark-jobserver

It contains the following subfolders:

Directory name	Content
logs	All the log files created by the Jobserver service.
data	The data used by the Jobserver service during runtime, it does not contain any critical state for the application to maintain.
config	The data of the Jobserver memory and CPU usage.
security	The public and private keys needed to use SSL encryption when communicating with the Jobserver REST API.
pgsql-data	The data that is stored by the Jobserver service, such as job information and the JAR files to register data sources.
spark-warehouse	The directory where the Spark tables are persisted.
temp-files	The directory to store temporary files during ingestion and profiling jobs.