Troubleshooting Self-hosted Kubernetes Install
This section provides the most common commands to run when troubleshooting a DQ environment that is deployed on self-hosted Kubernetes.
Additional documentation
### Provide documentation on syntax and flags in the terminal
kubectl help
### To see how to use Kubernetes resources
kubectl api-resources -o wide
Viewing Kubernetes resources
### Get Pods, their names & details in all Namespaces
kubectl get pods -A -o wide
### Get all Namespaces in a cluster
kubectl get namespaces
### Get Services in all Namespaces
kubectl get services -A -o wide
### List all deployments in all namespaces:
kubectl get deployments -A -o wide
Logs and events
### List Events sorted by timestamp in all namespaces
kubectl get events -A --sort-by=.metadata.creationTimestamp
### Get logs from a specific pod:
kubectl logs [my-pod-name]
Resource allocation
### If the Kubernetes Metrics Server is installed,
### the top command allows you to see the resource consumption for nodes or podscode
kubectl top node
kubectl top pod
### If the Kubernetes Metrics Server is NOT installed, use
kubectl describe nodes | grep Allocated -A 10
Configuration
### Get current-context
kubectl config current-context
### See all configs for the entire cluster
kubectl config view
Authorization issues
### Check to see if I can read pod logs for current user & context
kubectl auth can-i get pods --subresource=log
### Check to see if I can do everything in my current namespace ("*" means all)
kubectl auth can-i '*' '*'
Jobs stuck in the Staged activity
If DQ Jobs are stuck in the Staged activity on the Jobs page, ensure you have the Helm Chart corresponding to the latest Data Quality & Observability Classic version and update the following properties:
--set global.configMap.data.metastore_max_wait_time=10000
--set global.configMap.data.metastore_max_active=2000
--set global.configMap.data.metastore_max_idle=100
--set global.configMap.data.metastore_initial_size=150
Restart the Web and Agent pods after you set the recommended properties.
Jobs fail with OOMKilled error
If DQ jobs fail with the error OOMKilled
, you may need to allocate additional overhead. Follow these steps:
-
Increase the executor memory by modifying the option
-executormemory
-
In the
-conf
list of properties, add the propertyspark.kubernetes.memoryOverheadFactor=0.4