In a real-world customer environment, maintaining the health and security of OpenShift Container Platform (OCP) is crucial. Regular upgrades are essential to address security enhancements, introduce new features, improve compatibility, fix bugs, and ensure stability. Moreover, adhering to Red Hat's official support policy is vital for continued access to technical assistance and updates. However, during the upgrade process, certain CP4BA (Cloud Pak for Business Automation) settings may inadvertently interfere with OCP upgrades. One such setting is the PodDisruptionBudget (PDB) protection, which can prevent OCP from automatically evicting specific pods. This can lead to blocked node restarts, causing node degradation and potentially resulting in a failed OCP upgrade.
This blog post aims to provide a comprehensive guide on managing such situations and preventing pods from failing to evict during future upgrades. By understanding and addressing these challenges, you can ensure a smoother upgrade process and maintain the optimal performance of your OCP environment.
In this example, Business Automation Workflow Authoring 24.0.1 and Business Automation Workflow Runtime 24.0.1 are deployed across two separate namespaces, namely bawaut and bawrun. Both namespaces are of medium size and utilize EDB PostgreSQL as their database. The current OpenShift Container Platform (OCP) version in use is 4.18.4, and the objective is to upgrade to the most recent 4.18 minor version. To initiate the upgrade process, navigate to the OCP console and select "Cluster Settings" under the "Administration" section. Upon doing so, the available upgrade target will be displayed, revealing that the latest 4.18 minor version, 4.18.8, is ready for upgrade.
Click "Update" to trigger the OCP upgrade.
During the upgrade process, it is recommended to simultaneously log in to the cluster using oc login in a Linux or Mac environment and check the status of clusteroperators with the command:
oc get co
The following screenshot shows the cluster status before the upgrade.
During the upgrade process, all ClusterOperators will be upgraded, and their versions will change to 4.18.8. The following screenshot shows the status during the upgrade.
It is important to note that machine-config is always the last ClusterOperator to be upgraded, and it requires special attention. The following section explains the details.
When machine-config is being upgraded, the end user should check the pod status in openshift-machine-config-operator namespace. Once all pods are recreated, OCP will begin draining master and worker nodes for node reboots.
However, if some pods are protected by PodDisruptionBudgets (PDBs) at this stage, they will not be automatically evicted and thus prevent the node from rebooting. This ultimately causes the node to degrade, which can be checked using:
oc get mcp
Encountering this issue does not automatically imply that the upgrade has failed; it can still be salvaged. The initial step involves pinpointing the node that is unable to be drained.
oc get node
Typically, the node with the SchedulingDisabled label is the one affected. Next, check which pods on that node cannot be deleted:
oc get pod -A -o wide | grep <worker_node_name>
In this example, the focus is on the pods created by CP4BA. It was found that bawrun contains icp4adeploy-dba-rr-2034f07d0f, which has not been evicted. Further inspection revealed that this pod is protected by a PDB.
Additionally, checking the logs of the machine-config-controller in the openshift-machine-config-operator namespace will also reveal similar issues. (oc logs -l k8s-app=machine-config-controller -n openshift-machine-config-operator|tail -10)
In this case, the end user needs to manually delete the icp4adeploy-dba-rr-2034f07d0f pod (oc delete pod icp4adeploy-dba-rr-2034f07d0f -n bawrun), and after deletion, check the logs of the machine-config-controller again.
During the OCP upgrade, the status of worker12 will transition from Ready,SchedulingDisabled to NotReady,SchedulingDisabled, then back to Ready,SchedulingDisabled, and finally to Ready. OCP will then proceed to drain the next node. While draining nodes, users can proactively identify and manually delete any pods that cannot be evicted before a node enters a Degraded state, helping to prevent node degradation.
The following screenshots show a successful OCP upgrade to 4.18.8
Conclusion
In this blog post, we guided you through the OCP upgrade process, focusing on the crucial aspect of managing PodDisruptionBudgets (PDB) with Cloud Pak for Business Automation (CP4BA). Effective PDB management during the upgrade is essential for maintaining system stability and reducing downtime. By tackling PDB considerations proactively, you can avert disruptions and ensure a smooth upgrade experience. With meticulous planning and adherence to best practices, upgrading OCP alongside CP4BA becomes a more dependable and efficient endeavor.
Reference:
https://www.ibm.com/docs/en/cloud-paks/cp-biz-automation/24.0.1?topic=upgrading-red-hat-openshift-container-platform
https://kubernetes.io/docs/tasks/run-application/configure-pdb/