watsonx.data

 View Only
  • 1.  Resolving the wxdaddon Upgrade Issue

    Posted 12 days ago

    The problem: wxdaddon gets stuck on upgrade

    • Upgrading Cloud Pak for data from 4.8.2 to 4.8.5.

    •  When running the apply-cr step, cpd-cli failed when updating wxdaddon to the latest version. 

    Background:

    The wxdaddon upgrade in Cloud Pak for Data was stuck due to issues with the Presto engine. Detailed investigation revealed the root cause tied to these pod failures.

    Check status of the pod:

    1. Run the following command:
    oc describe pod ibm-lh-lakehouse-presto-01-single-blue-0

    2. Check the output:
    Events:

     Type     Reason    Age                 From     Message

     ----     ------    ----                ----     -------

     Normal   Killing   42m                 kubelet  Container ibm-lh-lakehouse-presto failed liveness probe, will be restarted

     Normal   Pulled    42m                 kubelet  Container image "cp.icr.io/cp/watsonx-data/ibm-lh-presto@sha256:7b31c176a1ba13eeec4fb2c0f577f163d87964a77faca7e64279d7ff6c609995" already present on machine

     Normal   Created   42m                 kubelet  Created container ibm-lh-lakehouse-presto

     Normal   Started   42m                 kubelet  Started container ibm-lh-lakehouse-presto

     Warning  BackOff   10m (x15 over 13m)  kubelet  Back-off restarting failed container ibm-lh-lakehouse-presto in pod ibm-lh-lakehouse-presto-01-single-blue-0_cpd-instance(bfd3b36a-4249-4bc4-bc0d-955848255146)

     Warning  Unhealthy 5m42s (x26 over 44m) kubelet Liveness probe failed

     Warning  Unhealthy 42s (x108 over 45m)  kubelet Readiness probe failed: dial tcp 10.129.2.63:8443: connect: connection refused

    Output Analysis:

    • Liveness Probe Failures: The Presto container repeatedly failed its liveness probe, causing Kubernetes to restart it continuously.

    • Readiness Probe Failures: The container also failed readiness probes, indicating it was not ready to accept traffic due to connection issues.

    • BackOff Events: Kubernetes applied a back-off mechanism due to repeated failures, delaying further restart attempts.

    These failures prevented the Presto engine from stabilizing, leading to the wxdaddon upgrade being stuck.

    Another approach to check the pod's health:

    oc get po | grep presto

    ibm-lh-lakehouse-presto-01-single-blue-0                          1/1     Running     0          39h

    If the pod shows up as 1/1, that means the pod is healthy

    Root Cause of the Issue:

    Misconfiguration of the AWS glue database caused the wxdaddon to not update.

    Solution steps:

    1. Provisioning New Presto Instance  without connecting it to any catalog in the infrastructure manager.
    2. Attaching the new Presto engine to an existing catalog/bucket pair.

    3. Removed the old Presto engine (ID 1) via the infrastructure manager.

    4. The new Presto engine (ID 747) replaced the old Presto engine (ID 1) following its deletion.

    5.  Identified that the ibm-lh-lakehouse-presto-01-single-blue-0 pod was failing as mentioned earlier.
    6. Operator Pod Intervention:

        • Killed the operator pod to expedite the start of all playbooks, avoiding a longer wait time.

        • Commands used:

          • oc get pods -n ${PROJECT_CPD_INST_OPERATORS} | grep ibm-lakehouse

          •  oc delete pod -n ${PROJECT_CPD_INST_OPERATORS} <pod_name>

    7. Completion of wxdaddon Update: The new operator pod completed the necessary work, and the wxdaddon update was successfully completed.


    #watsonx.data
    #PrestoEngine

    ------------------------------
    Eric Zakharian
    IBM watsonx.data Technical Support
    ------------------------------


  • 2.  RE: Resolving the wxdaddon Upgrade Issue

    Posted 5 days ago

    Eric, this is great. Thank you for sharing your findings with the community! 



    ------------------------------
    Austin Rexroat IBM AI Community Manager
    ------------------------------