Jobserver service

Architecture

The Jobserver is used to ingest data and to execute data profiling or to create sample data on the ingested data. You can ingest data when you register a data source.

It is an application that relies on Apache Spark to perform CPU and memory intensive computations quickly and efficiently. More specifically, the Jobserver acts as an interface between the Data Governance Center service and Spark, sending Spark job execution requests through a REST interface. The Jobserver also provides control over the single Spark jobs and the data used by Spark.

When running a profiling operation, the Jobserver starts a new Java Virtual Machine (JVM), running a Spark Context. The profiling operations are executed within this JVM and returns to the DGC service through the main Jobserver application.

Only one profiling operation can run at a time. If there are several profiling operations, they are queued for execution.

Data storage

The Jobserver must be installed on a dedicated server and is managed by Collibra Console through an agent.

The data of the Jobserver service is located in:

  • Linux with root permission: /opt/collibra_data/spark-jobserver
  • Linux without root permission: ~/collibra_data/spark-jobserver
  • Windows: C:\collibra_data\spark-jobserver

It contains the following subfolders:

Directory name

Content

logs All the log files created by the Jobserver service.
data The data used by the Jobserver service during runtime, it does not contain any critical state for the application to maintain.
config The data of the Jobserver memory and CPU usage.
security The public and private keys needed to use SSL encryption when communicating with the Jobserver REST API.

pgsql-data

The data that is stored by the Jobserver service, such as job information and the JAR files to register data sources.

spark-warehouse

The directory where the Spark tables are persisted.

temp-files

The directory to store temporary files during ingestion and profiling jobs.