Agent
Diagram
The diagram above provides a high-level overview of how agents work within Collibra DQ. Job execution is driven by DQ Jobs that are written to an agent_q
table inside the DQ Metastore (DQ-Postgres) via the Web App or REST API endpoint. Each active and available agent queries the DQ-Postgres table every 5 seconds to execute DQ Jobs for which the agent is responsible. For example, the EMR agent DQ-Agent3 only executes DQ Jobs scheduled to run on EMR.
When an agent picks up a DQ Job, it launches the job either locally on the agent node itself or on a cluster as a Spark job (if the agent is set up as an edge node of the cluster). Depending on where the job launches, the results of the DQ Job will write back to the DQ Metastore. The results then display in the DQ Web App, are exposed as REST API, and become available for direct SQL query against the DQ Metastore.
Setting up a DQ Agent with setup.sh
as part of the DQ package
Use the setup.sh
script located in /opt/owl/
(or other Base Path that your installation used). See the example in the code block below for setting up a DQ Agent with a Postgres server running localhost
on port 5432
with database postgres and Postgres username/password combo postgres/password
.
# PATH TO DIR THAT CONTAINS THE INSTALL DIR
export BASE_PATH=/opt
# PATH TO AGENT INSTALL DIR
export INSTALL_PATH=/opt/owl
# DQ Metadata Postgres Storage settings
export METASTORE_HOST=localhost
export METASTORE_PORT=5432
export METASTORE_DB=postgres
export METASTORE_USER=postgres
export METASTORE_PASSWORD=password
cd $INSTALL_PATH
# Install DQ Agent only
./setup.sh \
-owlbase=$BASE_PATH \
-options=owlagent \
-pguser=$METASTORE_USER \
-pgpassword=$METASTORE_PASSWORD \
-pgserver=${METASTORE_HOST}:${METASTORE_PORT}/${METASTORE_DB}
The setup script automatically generates the /opt/owl/config/owl.properties
file and encrypts the provided password.
Setting up a DQ Agent manually
Steps
- Open a terminal session and go to the directory with the installer.
- Run the following command to encrypt your DQ Metastore password before it is stored in the
/opt/owl/config/owl.properties
file:Copy# PATH TO AGENT INSTALL DIR
export INSTALL_PATH=/opt/owl
cd $INSTALL_PATH
#Encrypt DQ Metadata Postgres Storage password
./owlmanage.sh encrypt=passwordNote
owlmanage.sh
generates an encrypted string for the plain text password input. You can use the encrypted string in the/opt/owl/config/owl.properties
configuration file to avoid exposing the DQ Metadata Postgres Storage password. - Run the following command to open the
/opt/owl/config/owl.properties
configuration file:Copyvi $INSTALL_PATH/config/owl.properties
- Add the following properties to the configuration file:Copy
spring.datasource.url=jdbc:postgresql://{DB_HOST}:{DB_PORT}/{METASTORE_DB}
spring.datasource.username={METASTORE_USER}
spring.datasource.password={METASTORE_PASSWORD}
spring.datasource.driver-class-name=com.owl.org.postgresql.Driver
spring.agent.datasource.url=jdbc:postgresql://{DB_HOST}:{DB_PORT}/{METASTORE_DB}
spring.agent.datasource.username={METASTORE_USER}
spring.agent.datasource.password={METASTORE_PASSWORD}
spring.agent.datasource.driver-class-name=org.postgresql.Driver - Restart the DQ Web App.
Setting up the DQ Agent from the Admin Console
Steps
- On the Collibra DQ home page, hover your cursor over Settings and select Admin Console.
The Admin Console opens. - Click Remote Agent.
The Agent Management page opens. - In the last column of the Agents table, to the right, click the pencil icon to edit your agent.
The Edit Agent modal appears. - Enter the required information.
- Standalone
- Yarn
- K8s
Field Description Agent Id The numerical identifier of your agent. For example, 6.
This field auto-generates and cannot be edited.
Agent Name The unique name of your agent.
This field auto-generates and cannot be edited.
Agent Display Name The descriptive name of your agent that displays anywhere agent information is present in the DQ Web App. You can customize the Agent Display Name to make it easier to identify your agent.
Tip There are no character restrictions for the Agent Display Name field, but it is best practice to use only alphanumeric characters, hyphens, and underscores.
Is Local Select this option for Hadoop deployments only. Use Livy Not applicable. Livy Host Not applicable. Base Path The installation folder path for DQ. All other paths in the DQ Agent are relative to this installation path.
This is the location that is set as
OWL_BASE
in Full Standalone Setup and other installation setups followed byowl/
folder. For example, if the setup command isexport OWL_BASE=/home/centos
then the Base Path in the Agent configuration should be set to/home/centos/owl/
.Default:
/opt/owl/
.Collibra DQ Core JAR The file path to the DQ Core jar file.
Default
<Base Path>/owl/bin/
Collibra DQ Core Logs The folder path where DQ Core logs are stored. Logs from DQ Jobs are stored in this folder.
Default:
<Base Path>/owl/log
Collibra DQ Script The file path to DQ execution script
owlcheck.sh
. This script is used to run DQ Job via command line without using the agent. Usingowlcheck.sh
for running DQ Jobs is superseded by DQ Agent execution model.Default:
<Base Path>/owl/bin/owlcheck
Collibra DQ Web Logs The folder path where DQ Web logs are stored. Logs from the DQ Web App are stored in this folder.
Default:
<Base Path>/owl/log
Default Queue Only used for Yarn. Deploy Deployment Mode The Spark deployment mode can be either Client or Cluster. While we recommend Cluster, there are best practices to follow:
- If you only have one Spark Worker node, it is best practice to select Client.
- If you have more than one Spark Worker node, it is best practice to select Cluster.
Default Master The Spark Master URL copied from the Spark cluster verification screen. For example,
spark://...
Dynamic Spark Allocation Not applicable. Spark Configuration Key Not applicable. Spark Configuration Value Not applicable. Number of Executor(s) The default number of executors allocated per DQ Job when using this Agent to run DQ Scans. The default is 1. Executor Memory The default RAM per executors allocated per DQ Job when using this Agent to run DQ Scans. Go to Hardware Sizing for more information.
The default is 1 gigabyte.
Number of Core(s) The default number of cores per executors allocated per DQ Job when using this Agent to run DQ Scans. Go to Hardware Sizing for more information.
The default is 1.
Driver Memory The default driver RAM allocated per DQ Job when using this Agent to run DQ Scans. Go to Hardware Sizing for more information.
The default is 1 gigabyte.
Free Form (Appended) Other spark-submit
parameters to append to each DQ Job when using this Agent to run DQ Scans.Field Description Agent Id The numerical identifier of your agent. For example, 6.
This field auto-generates and cannot be edited.
Agent Name The unique name of your agent.
This field auto-generates and cannot be edited.
Agent Display Name The descriptive name of your agent that displays anywhere agent information is present in the DQ Web App. You can customize the Agent Display Name to make it easier to identify your agent.
Tip There are no character restrictions for the Agent Display Name field, but it is best practice to use only alphanumeric characters, hyphens, and underscores.
Is Local You can select this option to form the driver location path, which is normally applicable only when you run your agent in the master or edge node. Use Livy Not applicable. Livy Host Not applicable. Base Path The installation folder path for DQ. All other paths in the DQ Agent are relative to this installation path.
This is the location that is set as
OWL_BASE
in Full Standalone Setup and other installation setups followed byowl/
folder. For example, if the setup command isexport OWL_BASE=/home/centos
then the Base Path in the Agent configuration should be set to/home/centos/owl/
.Default:
/opt/owl/
Collibra DQ Core JAR The file path to the DQ Core jar file.
Default:
<Base Path>/owl/bin/
Collibra DQ Core Logs The folder path where DQ Core logs are stored. Logs from DQ Jobs are stored in this folder.
Default:
<Base Path>/owl/log
Collibra DQ Script The file path to the DQ execution script
owlcheck.sh
. This script is used to run DQ Jobs via the command line without using an agent. Usingowlcheck.sh
for running DQ Jobs is superseded by the DQ Agent execution model.Default:
<Base Path>/owl/bin/owlcheck
Collibra DQ Web Logs The folder path where DQ Web logs are stored. Logs from the DQ Web App are stored in this folder.
Default:
<Base Path>/owl/log
.Default Queue The default resource queue to submit jobs. Default Deployment Mode The Spark deployment mode for Yarn is Cluster. Default Master Set to
Yarn
.Click Edit Yarn Config to ensure you have the necessary Hadoop xml files. Edit the file templates as necessary:
XML File Description core-site.xml Contains information about where authentication protocol, HDFS_RPC_PROTECTION, and the NAME_NODE run in the Hadoop cluster.
Show example templateCopy<configuration>
<property>
<name>hadoop.security.authentication</name>
<value>kerberos</value>
</property>
<property>
<name>hadoop.rpc.protection</name>
<value>$HDFS_RPC_PROTECTION</value>
</property>
<property>
<name>fs.defaultFS</name>
<value>hdfs://$NAME_NODE:8020</value>
</property>
</configuration>hdfs-site.xml Contains the configuration settings for authentication protocol, the NAME_NODE, and DATA_NODE.
Show example templateCopy<configuration>
<property>
<name>hadoop.security.authentication</name>
<value>HDFS/_HOST@$KERBEROS_DOMAIN</value>
</property>
</configuration>yarn-site.xml Contains the Yarn resource manager settings.
Show example templateCopy<configuration>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>$RESOURCE_MANAGER:8030</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>$RESOURCE_MANAGER:8032</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.https.address</name>
<value>$RESOURCE_MANAGER:8090</value>
</property>
</configuration>Dynamic Spark Allocation Not applicable. Spark Configuration Key Not applicable. Spark Configuration Value Not applicable. Number of Executor(s) The default number of executors allocated per DQ Job when using this Agent to run DQ Scans.
The default is 1.
Executor Memory The default RAM per executors allocated per DQ Job when using this Agent to run DQ Scans. Go to Hardware Sizing for more information.
The default is 1 gigabyte.
Number of Core(s) The default number of cores per executors allocated per DQ Job when using this Agent to run DQ Scans. Go to Hardware Sizing for more information.
The default is 1.
Driver Memory The default driver RAM allocated per DQ Job when using this Agent to run DQ Scans. Go to Hardware Sizing for more information.
The default is 1 gigabyte.
Free Form (Appended) Other spark-submit
parameters to append to each DQ Job when using this Agent to run DQ Scans.Note Ensure that your service account has the permission to launch Spark executor pods. Refer to the executor launch template and permissions.
Field Description Agent Id The numerical identifier of your agent. For example, 6.
This field auto-generates and cannot be edited.
Agent Name The unique name of your agent.
This field auto-generates and cannot be edited.
Agent Display Name The descriptive name of your agent that displays anywhere agent information is present in the DQ Web App. You can customize the Agent Display Name to make it easier to identify your agent.
Tip There are no character restrictions for the Agent Display Name field, but it is best practice to use only alphanumeric characters, hyphens, and underscores.
Is Local Select this option for Hadoop deployments only. Use Livy Not applicable. Livy Host Not applicable. Base Path The installation folder path for DQ. All other paths in the DQ Agent are relative to this installation path.
This is the location that is set as
OWL_BASE
in Full Standalone Setup and other installation setups followed byowl/
folder. For example, if the setup command isexport OWL_BASE=/home/centos
then the Base Path in the Agent configuration should be set to/home/centos/owl/
.Default:
/opt/owl/
Collibra DQ Core JAR The file path to the DQ Core jar file.
Default:
<Base Path>/owl/bin/
Collibra DQ Core Logs The folder path where DQ Core logs are stored. Logs from DQ Jobs are stored in this folder.
Default:
<Base Path>/owl/log
Collibra DQ Script The file path to DQ execution script
owlcheck.sh
. This script is used to run DQ Job via command line without using agent. Usingowlcheck.sh
for running DQ Jobs is superseded by DQ Agent execution model. Default:<Base Path>/owl/bin/owlcheck
.Collibra DQ Web Logs The folder path where DQ Web logs are stored. Logs from the DQ Web App are stored in this folder.
Default:
<Base Path>/owl/log
Default Queue Only used for Yarn. Default Deployment Mode The Spark deployment mode for Kubernetes is Cluster. Default Master The Kubernetes Master URL copied from the Kubernetes cluster verification screen.
Set this value to
k8s://
instead of a specific URL. When you leave this value set tok8s://
, this helps Collibra DQ auto-discover the High Availability endpoint of the Kubernetes control plane at runtime.Warning Only set this to a specific URL, such as
k8s://{hostname}:443
, if you are an advanced user or if your specific use case requires it.Dynamic Spark Allocation Not applicable. Spark Configuration Key Not applicable. Spark Configuration Value Not applicable. Number of Executor(s) The default number of executors allocated per DQ Job when using this Agent to run DQ Scans. The default is 1. Executor Memory The default RAM per executors allocated per DQ Job when using this Agent to run DQ Scans. Go to Hardware Sizing for more information.
The default is 1 gigabyte.
Number of Core(s) The default number of cores per executors allocated per DQ Job when using this Agent to run DQ Scans. Go to Hardware Sizing for more information.
The default is 1.
Driver Memory The default driver RAM allocated per DQ Job when using this Agent to run DQ Scans. Go to Hardware Sizing for more information.
The default is 1 gigabyte.
Free Form (Appended) Other spark-submit
parameters to append to each DQ Job when using this Agent to run DQ Scans.Note If you bring in your own Spark executor pod launch template, ensure that the service account used to launch Spark executor pods has the permission to do so. Refer to the executor launch template and for more information.
- Click Save.
Linking data sources to the DQ Agent from the Admin Console
When you add new Data Sources, the DQ Agent requires permission to run DQ Jobs with them.
Steps
- On the Collibra DQ home page, hover your cursor over Settings and select Admin Console.
The Admin Console opens. - Click Remote Agent.
The Agent Management page opens. - In the last column of the Agents table, to the right, click the chain link icon to link your agent to data source connections.
The Agent to Connection Management wizard appears.Note The left panel contains a list of available connections that are not yet linked to the DQ Agent and do not yet have permission to run DQ Jobs. The right panel contains a list of connections that are linked to the DQ Agent and have permission to run DQ Jobs.
- Click a connection in the left panel to link connections one at a time or click the double arrow icon to link all available connections at the same time.
- Click Update.
Tip You can unlink connections with the same methods listed above, but click the connections listed in the right panel instead of the left. Successfully unlinked connections appear in the left panel.