Troubleshooting Self-hosted Kubernetes Install

This section provides the most common commands to run when troubleshooting a DQ environment that is deployed on self-hosted Kubernetes.

Additional documentation

Copy
### Provide documentation on syntax and flags in the terminal

kubectl help

### To see how to use Kubernetes resources

kubectl api-resources -o wide

Viewing Kubernetes resources

Copy
### Get Pods, their names & details in all Namespaces

kubectl get pods -A -o wide

### Get all Namespaces in a cluster

kubectl get namespaces

### Get Services in all Namespaces

kubectl get services -A -o wide

### List all deployments in all namespaces:

kubectl get deployments -A -o wide

Logs and events

Copy
### List Events sorted by timestamp in all namespaces

kubectl get events -A --sort-by=.metadata.creationTimestamp

### Get logs from a specific pod:

kubectl logs [my-pod-name]

Resource allocation

Copy
### If the Kubernetes Metrics Server is installed, 
### the top command allows you to see the resource consumption for nodes or podscode

kubectl top node
kubectl top pod

### If the Kubernetes Metrics Server is NOT installed, use

kubectl describe nodes | grep Allocated -A 10 

Configuration

Copy
### Get current-context

kubectl config current-context

### See all configs for the entire cluster

kubectl config view

Authorization issues

Copy
### Check to see if I can read pod logs for current user & context

kubectl auth can-i get pods --subresource=log

### Check to see if I can do everything in my current namespace ("*" means all)

kubectl auth can-i '*' '*'

Jobs stuck in the Staged activity

If DQ Jobs are stuck in the Staged activity on the Jobs page, ensure you have the Helm Chart corresponding to the latest Data Quality & Observability Classic version and update the following properties:

Copy
--set global.configMap.data.metastore_max_wait_time=10000
--set global.configMap.data.metastore_max_active=2000
--set global.configMap.data.metastore_max_idle=100
--set global.configMap.data.metastore_initial_size=150

Restart the Web and Agent pods after you set the recommended properties.

Jobs fail with OOMKilled error

If DQ jobs fail with the error OOMKilled, you may need to allocate additional overhead. Follow these steps:

  • Increase the executor memory by modifying the option -executormemory

  • In the -conf list of properties, add the property spark.kubernetes.memoryOverheadFactor=0.4