Troubleshooting Self-hosted Kubernetes Install

This section provides the most common commands to run when troubleshooting a DQ environment that is deployed on self-hosted Kubernetes.

Additional documentation

Copy

### Provide documentation on syntax and flags in the terminal

kubectl help

### To see how to use Kubernetes resources

kubectl api-resources -o wide

Viewing Kubernetes resources

Copy

### Get Pods, their names & details in all Namespaces

kubectl get pods -A -o wide

### Get all Namespaces in a cluster

kubectl get namespaces

### Get Services in all Namespaces

kubectl get services -A -o wide

### List all deployments in all namespaces:

kubectl get deployments -A -o wide

Logs and events

Copy

### List Events sorted by timestamp in all namespaces

kubectl get events -A --sort-by=.metadata.creationTimestamp

### Get logs from a specific pod:

kubectl logs [my-pod-name]

Resource allocation

Copy

### If the Kubernetes Metrics Server is installed, 
### the top command allows you to see the resource consumption for nodes or podscode

kubectl top node
kubectl top pod

### If the Kubernetes Metrics Server is NOT installed, use

kubectl describe nodes | grep Allocated -A 10

Configuration

Copy

### Get current-context

kubectl config current-context

### See all configs for the entire cluster

kubectl config view

Authorization issues

Copy

### Check to see if I can read pod logs for current user & context

kubectl auth can-i get pods --subresource=log

### Check to see if I can do everything in my current namespace ("*" means all)

kubectl auth can-i '*' '*'

Increase thread pool

In some situations, you may need to increase the thread pool and adjust timeout settings. For example:

The user interface responds slowly.
The thread pool is exhausted (error logs include PoolExhaustedException).

Update the following settings in owl.properties. Note that they are for a single-tenant deployment:

Copy

spring.datasource.tomcat.initial-size=5
spring.datasource.tomcat.max-active=30
spring.datasource.tomcat.max-wait=1000

In multiple-tenant deployments, use the following formula to determine the initial size (initial-size):

(number of tenants) * 4

For example, if the deployment consists of three tenants, the initial size should be set to (3*4) = 12.

Important Be sure to increase the maximum values as well, to ensure that they are always greater than the initial size.

Restart DQ Web and DQ Agent after making changes to the settings.

Jobs stuck in the Staged activity

If DQ Jobs are stuck in the Staged activity on the Jobs page, ensure you have the Helm Chart corresponding to the latest Data Quality & Observability Classic version and update the following properties:

Copy

--set global.configMap.data.metastore_max_wait_time=10000
--set global.configMap.data.metastore_max_active=2000
--set global.configMap.data.metastore_max_idle=100
--set global.configMap.data.metastore_initial_size=150

Restart the Web and Agent pods after you set the recommended properties.

Jobs fail with OOMKilled error

If DQ jobs fail with the error OOMKilled, you may need to allocate additional overhead. Follow these steps:

Increase the executor memory by modifying the option -executormemory
In the -conf list of properties, add the property spark.kubernetes.memoryOverheadFactor=0.4