Agent

Diagram

dq agent diagram

The diagram above provides a high-level overview of how agents work within Collibra DQ. Job execution is driven by DQ Jobs that are written to an agent_q table inside the DQ Metastore (DQ-Postgres) via the Web App or REST API endpoint. Each active and available agent queries the DQ-Postgres table every 5 seconds to execute DQ Jobs for which the agent is responsible. For example, the EMR agent DQ-Agent3 only executes DQ Jobs scheduled to run on EMR.

When an agent picks up a DQ Job, it launches the job either locally on the agent node itself or on a cluster as a Spark job (if the agent is set up as an edge node of the cluster). Depending on where the job launches, the results of the DQ Job will write back to the DQ Metastore. The results then display in the DQ Web App, are exposed as REST API, and become available for direct SQL query against the DQ Metastore.

Setting up a DQ Agent with setup.sh as part of the DQ package

Use the setup.sh script located in /opt/owl/ (or other Base Path that your installation used). See the example in the code block below for setting up a DQ Agent with a Postgres server running localhost on port 5432 with database postgres and Postgres username/password combo postgres/password.

Copy
# PATH TO DIR THAT CONTAINS THE INSTALL DIR
export BASE_PATH=/opt

# PATH TO AGENT INSTALL DIR
export INSTALL_PATH=/opt/owl

# DQ Metadata Postgres Storage settings 
export METASTORE_HOST=localhost
export METASTORE_PORT=5432
export METASTORE_DB=postgres
export METASTORE_USER=postgres
export METASTORE_PASSWORD=password 

cd $INSTALL_PATH

# Install DQ Agent only
./setup.sh \
    -owlbase=$BASE_PATH \
    -options=owlagent \
    -pguser=$METASTORE_USER \
    -pgpassword=$METASTORE_PASSWORD \
    -pgserver=${METASTORE_HOST}:${METASTORE_PORT}/${METASTORE_DB}

The setup script automatically generates the /opt/owl/config/owl.properties file and encrypts the provided password.

Setting up a DQ Agent manually

Steps

  1. Open a terminal session and go to the directory with the installer.
  2. Run the following command to encrypt your DQ Metastore password before it is stored in the /opt/owl/config/owl.properties file:
    Copy
    # PATH TO AGENT INSTALL DIR
    export INSTALL_PATH=/opt/owl

    cd $INSTALL_PATH

    #Encrypt DQ Metadata Postgres Storage password
    ./owlmanage.sh encrypt=password

    Note owlmanage.sh generates an encrypted string for the plain text password input. You can use the encrypted string in the /opt/owl/config/owl.properties configuration file to avoid exposing the DQ Metadata Postgres Storage password.

  3. Run the following command to open the /opt/owl/config/owl.properties configuration file:
    Copy
    vi $INSTALL_PATH/config/owl.properties
  4. Add the following properties to the configuration file:
    Copy
    spring.datasource.url=jdbc:postgresql://{DB_HOST}:{DB_PORT}/{METASTORE_DB}
    spring.datasource.username={METASTORE_USER}
    spring.datasource.password={METASTORE_PASSWORD}
    spring.datasource.driver-class-name=com.owl.org.postgresql.Driver
     
    spring.agent.datasource.url=jdbc:postgresql://{DB_HOST}:{DB_PORT}/{METASTORE_DB}
    spring.agent.datasource.username={METASTORE_USER}
    spring.agent.datasource.password={METASTORE_PASSWORD}
    spring.agent.datasource.driver-class-name=org.postgresql.Driver
  5. Restart the DQ Web App.

Setting up the DQ Agent from the Admin Console

Steps

  1. On the Collibra DQ home page, hover your cursor over Settings and select Admin Console.
    The Admin Console opens.
  2. Click Remote Agent.
    The Agent Management page opens.
  3. In the last column of the Agents table, to the right, click the pencil icon to edit your agent.
    The Edit Agent modal appears.
  4. Enter the required information.

    FieldDescription
    Agent Id

    The numerical identifier of your agent. For example, 6.

    This field auto-generates and cannot be edited.

    Agent Name

    The unique name of your agent.

    This field auto-generates and cannot be edited.

    Agent Display Name

    The descriptive name of your agent that displays anywhere agent information is present in the DQ Web App. You can customize the Agent Display Name to make it easier to identify your agent.

    Tip There are no character restrictions for the Agent Display Name field, but it is best practice to use only alphanumeric characters, hyphens, and underscores.

    Is LocalSelect this option for Hadoop deployments only.
    Use LivyNot applicable.
    Livy HostNot applicable.
    Base Path

    The installation folder path for DQ. All other paths in the DQ Agent are relative to this installation path.

    This is the location that is set as OWL_BASE in Full Standalone Setup and other installation setups followed by owl/ folder. For example, if the setup command is export OWL_BASE=/home/centos then the Base Path in the Agent configuration should be set to /home/centos/owl/.

    Default: /opt/owl/.

    Collibra DQ Core JAR

    The file path to the DQ Core jar file.

    Default <Base Path>/owl/bin/

    Collibra DQ Core Logs

    The folder path where DQ Core logs are stored. Logs from DQ Jobs are stored in this folder.

    Default: <Base Path>/owl/log

    Collibra DQ Script

    The file path to DQ execution script owlcheck.sh. This script is used to run DQ Job via command line without using the agent. Using owlcheck.sh for running DQ Jobs is superseded by DQ Agent execution model.

    Default: <Base Path>/owl/bin/owlcheck

    Collibra DQ Web Logs

    The folder path where DQ Web logs are stored. Logs from the DQ Web App are stored in this folder.

    Default: <Base Path>/owl/log

    Default QueueOnly used for Yarn.
    Deploy Deployment Mode

    The Spark deployment mode can be either Client or Cluster. While we recommend Cluster, there are best practices to follow:

    • If you only have one Spark Worker node, it is best practice to select Client.
    • If you have more than one Spark Worker node, it is best practice to select Cluster.
    Default Master

    The Spark Master URL copied from the Spark cluster verification screen. For example, spark://...

    Dynamic Spark AllocationNot applicable.
    Spark Configuration KeyNot applicable.
    Spark Configuration ValueNot applicable.
    Number of Executor(s)The default number of executors allocated per DQ Job when using this Agent to run DQ Scans. The default is 1.
    Executor Memory

    The default RAM per executors allocated per DQ Job when using this Agent to run DQ Scans. Go to Hardware Sizing for more information.

    The default is 1 gigabyte.

    Number of Core(s)

    The default number of cores per executors allocated per DQ Job when using this Agent to run DQ Scans. Go to Hardware Sizing for more information.

    The default is 1.

    Driver Memory

    The default driver RAM allocated per DQ Job when using this Agent to run DQ Scans. Go to Hardware Sizing for more information.

    The default is 1 gigabyte.

    Free Form (Appended)Other spark-submit parameters to append to each DQ Job when using this Agent to run DQ Scans.

    FieldDescription
    Agent Id

    The numerical identifier of your agent. For example, 6.

    This field auto-generates and cannot be edited.

    Agent Name

    The unique name of your agent.

    This field auto-generates and cannot be edited.

    Agent Display Name

    The descriptive name of your agent that displays anywhere agent information is present in the DQ Web App. You can customize the Agent Display Name to make it easier to identify your agent.

    Tip There are no character restrictions for the Agent Display Name field, but it is best practice to use only alphanumeric characters, hyphens, and underscores.

    Is LocalYou can select this option to form the driver location path, which is normally applicable only when you run your agent in the master or edge node.
    Use LivyNot applicable.
    Livy HostNot applicable.
    Base Path

    The installation folder path for DQ. All other paths in the DQ Agent are relative to this installation path.

    This is the location that is set as OWL_BASE in Full Standalone Setup and other installation setups followed by owl/ folder. For example, if the setup command is export OWL_BASE=/home/centos then the Base Path in the Agent configuration should be set to /home/centos/owl/.

    Default: /opt/owl/

    Collibra DQ Core JAR

    The file path to the DQ Core jar file.

    Default: <Base Path>/owl/bin/

    Collibra DQ Core Logs

    The folder path where DQ Core logs are stored. Logs from DQ Jobs are stored in this folder.

    Default: <Base Path>/owl/log

    Collibra DQ Script

    The file path to the DQ execution script owlcheck.sh. This script is used to run DQ Jobs via the command line without using an agent. Using owlcheck.sh for running DQ Jobs is superseded by the DQ Agent execution model.

    Default: <Base Path>/owl/bin/owlcheck

    Collibra DQ Web Logs

    The folder path where DQ Web logs are stored. Logs from the DQ Web App are stored in this folder.

    Default: <Base Path>/owl/log.

    Default QueueThe default resource queue to submit jobs.
    Default Deployment ModeThe Spark deployment mode for Yarn is Cluster.
    Default Master

    Set to Yarn.

    Click Edit Yarn Config to ensure you have the necessary Hadoop xml files. Edit the file templates as necessary:

    XML FileDescription
    core-site.xml

    Contains information about where authentication protocol, HDFS_RPC_PROTECTION, and the NAME_NODE run in the Hadoop cluster.

    hdfs-site.xml

    Contains the configuration settings for authentication protocol, the NAME_NODE, and DATA_NODE.

    yarn-site.xml

    Contains the Yarn resource manager settings.

    Dynamic Spark AllocationNot applicable.
    Spark Configuration KeyNot applicable.
    Spark Configuration ValueNot applicable.
    Number of Executor(s)

    The default number of executors allocated per DQ Job when using this Agent to run DQ Scans.

    The default is 1.

    Executor Memory

    The default RAM per executors allocated per DQ Job when using this Agent to run DQ Scans. Go to Hardware Sizing for more information.

    The default is 1 gigabyte.

    Number of Core(s)

    The default number of cores per executors allocated per DQ Job when using this Agent to run DQ Scans. Go to Hardware Sizing for more information.

    The default is 1.

    Driver Memory

    The default driver RAM allocated per DQ Job when using this Agent to run DQ Scans. Go to Hardware Sizing for more information.

    The default is 1 gigabyte.

    Free Form (Appended)Other spark-submit parameters to append to each DQ Job when using this Agent to run DQ Scans.

    Note Ensure that your service account has the permission to launch Spark executor pods. Refer to the executor launch template and permissions.

    FieldDescription
    Agent Id

    The numerical identifier of your agent. For example, 6.

    This field auto-generates and cannot be edited.

    Agent Name

    The unique name of your agent.

    This field auto-generates and cannot be edited.

    Agent Display Name

    The descriptive name of your agent that displays anywhere agent information is present in the DQ Web App. You can customize the Agent Display Name to make it easier to identify your agent.

    Tip There are no character restrictions for the Agent Display Name field, but it is best practice to use only alphanumeric characters, hyphens, and underscores.

    Is LocalSelect this option for Hadoop deployments only.
    Use LivyNot applicable.
    Livy HostNot applicable.
    Base Path

    The installation folder path for DQ. All other paths in the DQ Agent are relative to this installation path.

    This is the location that is set as OWL_BASE in Full Standalone Setup and other installation setups followed by owl/ folder. For example, if the setup command is export OWL_BASE=/home/centos then the Base Path in the Agent configuration should be set to /home/centos/owl/.

    Default: /opt/owl/

    Collibra DQ Core JAR

    The file path to the DQ Core jar file.

    Default: <Base Path>/owl/bin/

    Collibra DQ Core Logs

    The folder path where DQ Core logs are stored. Logs from DQ Jobs are stored in this folder.

    Default: <Base Path>/owl/log

    Collibra DQ Script

    The file path to DQ execution script owlcheck.sh. This script is used to run DQ Job via command line without using agent. Using owlcheck.sh for running DQ Jobs is superseded by DQ Agent execution model. Default: <Base Path>/owl/bin/owlcheck.

    Collibra DQ Web Logs

    The folder path where DQ Web logs are stored. Logs from the DQ Web App are stored in this folder.

    Default: <Base Path>/owl/log

    Default QueueOnly used for Yarn.
    Default Deployment ModeThe Spark deployment mode for Kubernetes is Cluster.
    Default Master

    The Kubernetes Master URL copied from the Kubernetes cluster verification screen.

    Set this value to k8s:// instead of a specific URL. When you leave this value set to k8s://, this helps Collibra DQ auto-discover the High Availability endpoint of the Kubernetes control plane at runtime.

    Warning Only set this to a specific URL, such as k8s://{hostname}:443, if you are an advanced user or if your specific use case requires it.

    Dynamic Spark AllocationNot applicable.
    Spark Configuration KeyNot applicable.
    Spark Configuration ValueNot applicable.
    Number of Executor(s)The default number of executors allocated per DQ Job when using this Agent to run DQ Scans. The default is 1.
    Executor Memory

    The default RAM per executors allocated per DQ Job when using this Agent to run DQ Scans. Go to Hardware Sizing for more information.

    The default is 1 gigabyte.

    Number of Core(s)

    The default number of cores per executors allocated per DQ Job when using this Agent to run DQ Scans. Go to Hardware Sizing for more information.

    The default is 1.

    Driver Memory

    The default driver RAM allocated per DQ Job when using this Agent to run DQ Scans. Go to Hardware Sizing for more information.

    The default is 1 gigabyte.

    Free Form (Appended)Other spark-submit parameters to append to each DQ Job when using this Agent to run DQ Scans.

    Note If you bring in your own Spark executor pod launch template, ensure that the service account used to launch Spark executor pods has the permission to do so. Refer to the executor launch template and for more information.

  5. Click Save.

Linking data sources to the DQ Agent from the Admin Console

When you add new Data Sources, the DQ Agent requires permission to run DQ Jobs with them.

Steps

  1. On the Collibra DQ home page, hover your cursor over Settings and select Admin Console.
    The Admin Console opens.
  2. Click Remote Agent.
    The Agent Management page opens.
  3. In the last column of the Agents table, to the right, click the chain link icon to link your agent to data source connections.
    The Agent to Connection Management wizard appears.

    Note The left panel contains a list of available connections that are not yet linked to the DQ Agent and do not yet have permission to run DQ Jobs. The right panel contains a list of connections that are linked to the DQ Agent and have permission to run DQ Jobs.

  4. Click a connection in the left panel to link connections one at a time or click the double arrow icon to link all available connections at the same time.
  5. Click Update.

Tip You can unlink connections with the same methods listed above, but click the connections listed in the right panel instead of the left. Successfully unlinked connections appear in the left panel.

Adding a connection to a DQ Agent