The problem: wxdaddon gets stuck on upgrade
-
Upgrading Cloud Pak for data from 4.8.2 to 4.8.5.
-
When running the apply-cr step, cpd-cli failed when updating wxdaddon to the latest version.
Background:
The wxdaddon upgrade in Cloud Pak for Data was stuck due to issues with the Presto engine. Detailed investigation revealed the root cause tied to these pod failures.
Check status of the pod:
1. Run the following command:
oc describe pod ibm-lh-lakehouse-presto-01-single-blue-0
2. Check the output:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Killing 42m kubelet Container ibm-lh-lakehouse-presto failed liveness probe, will be restarted
Normal Pulled 42m kubelet Container image "cp.icr.io/cp/watsonx-data/ibm-lh-presto@sha256:7b31c176a1ba13eeec4fb2c0f577f163d87964a77faca7e64279d7ff6c609995" already present on machine
Normal Created 42m kubelet Created container ibm-lh-lakehouse-presto
Normal Started 42m kubelet Started container ibm-lh-lakehouse-presto
Warning BackOff 10m (x15 over 13m) kubelet Back-off restarting failed container ibm-lh-lakehouse-presto in pod ibm-lh-lakehouse-presto-01-single-blue-0_cpd-instance(bfd3b36a-4249-4bc4-bc0d-955848255146)
Warning Unhealthy 5m42s (x26 over 44m) kubelet Liveness probe failed
Warning Unhealthy 42s (x108 over 45m) kubelet Readiness probe failed: dial tcp 10.129.2.63:8443: connect: connection refused
Output Analysis:
-
Liveness Probe Failures: The Presto container repeatedly failed its liveness probe, causing Kubernetes to restart it continuously.
-
Readiness Probe Failures: The container also failed readiness probes, indicating it was not ready to accept traffic due to connection issues.
-
BackOff Events: Kubernetes applied a back-off mechanism due to repeated failures, delaying further restart attempts.
These failures prevented the Presto engine from stabilizing, leading to the wxdaddon upgrade being stuck.
Another approach to check the pod's health:
oc get po | grep presto
ibm-lh-lakehouse-presto-01-single-blue-0 1/1 Running 0 39h
If the pod shows up as 1/1, that means the pod is healthy
Root Cause of the Issue:
Misconfiguration of the AWS glue database caused the wxdaddon to not update.
Solution steps:
- Provisioning New Presto Instance without connecting it to any catalog in the infrastructure manager.
-
Attaching the new Presto engine to an existing catalog/bucket pair.
-
Removed the old Presto engine (ID 1) via the infrastructure manager.
-
The new Presto engine (ID 747) replaced the old Presto engine (ID 1) following its deletion.
- Identified that the ibm-lh-lakehouse-presto-01-single-blue-0 pod was failing as mentioned earlier.
-
Operator Pod Intervention:
-
Completion of wxdaddon Update: The new operator pod completed the necessary work, and the wxdaddon update was successfully completed.
#watsonx.data#PrestoEngine------------------------------
Eric Zakharian
IBM watsonx.data Technical Support
------------------------------