Compute Node Maintenance on IBM Storage Fusion HCI
What is node maintenance?
Node maintenance allows cluster administrators to gracefully power down nodes, moving workloads to other parts of the cluster, and ensuring workloads are not interrupted. This will ensure workloads are healthy and running.
Why is compute hardware maintenance required on IBM Storage Fusion HCI?
To keep the HCI appliance operational and serve the business value, the hardware platform needs to be healthy and up-to-date with regular component upgrades and maintenance.
IBM Storage Fusion HCI appliance is designed and equipped to avoid outages caused by single hardware component failures by using redundant network connections, power supplies, fans etc.
The hardware platform must be healthy and up-to-date with regular component upgrades and maintenance so that IBM Storage Fusion HCI appliance provides the expected business value and remains operational.
Whenever there is loss of redundancy, compute node hardware parts can be repaired or replaced without cluster downtime.
While a failed node is getting repaired or replaced, the workload on the failed node needs to be moved to other parts of the cluster ensuring its service continuity. IBM Storage Fusion HCI compute maintenance feature serves this purpose.
During loss of redundancy, the compute node hardware parts can be repaired or replaced without cluster downtime.
While a failed node is getting repaired or replaced, move the workload on the failed node to other parts of the cluster to ensure its service continuity. IBM Storage Fusion HCI compute maintenance feature serves this purpose.
Generic use cases of node maintenance:
1. The hardware part replacement that requires a reboot of the node.
2. A node reboot during operations like firmware upgrade or software upgrade or software patching.
How are nodes are put to maintenance mode from IBM Storage Fusion HCI?
These are the two ways in which node is moved to IBM Storage Fusion HCI maintenance mode on IBM Storage Fusion HCI:
•Explicit user action : Node maintenance from IBM Storage Fusion HCI UI (Nodes > Enable maintenance action).
•Implicit operations : Operations that require server reboot like firmware upgrade of the node from the IBM Storage Fusion HCI GUI.
Note: IBM Storage Fusion HCI is unrelated to node maintenance performed through Red Hat OpenShift and hence it is beyond the scope of this article.
What is OpenShift node maintenance?
Red Hat OpenShift node maintenance can be performed using the “Mark as unschedulable” action from the Red Hat OpenShift Console > Compute > Nodes menu.
It marks the node status as “Scheduling disabled” and the following taint gets added on the node:
- key: node.kubernetes.io/unschedulable
effect: NoSchedule
The kube-scheduler excludes this node from getting scheduled for any new workload pods, and the already running pods on this node remain unaffected.
The alternative way to move the node into maintenance from Red Hat OpenShift is through the following command in the OCP CLI interface:
oc adm cordon <target-node>
oc adm drain <target-node>
These commands block the scheduling of any workload pods on the node, and subsequently drains the node by draining all the running pods on this node.
Note that all the pods from the node are drained regardless of the storage cluster health. The difference between this approach using the OCP CLI and the node maintenance action from IBM Storage Fusion HCI menu is that the latter always ensures that the IBM Spectrum Scale Storage Cluster state is healthy before cordoning of the node to ensure GPFS quorum is not disturbed.
In Red Hat OpenShift Container Platform CLI, all the pods of a node drain, irrespective of the storage cluster health. The difference between this approach and the node maintenance action from the IBM Storage Fusion HCI menu is that the latter always ensures that the Scale storage cluster state is healthy before cordoning the node so that the GPFS quorum is not disturbed.
How is it achieved by IBM Storage HCI compute node maintenance?
o Step#1:
•The node is tainted to mark it as not schedulable for any new pods/workload.
o Step#2:
•All existing evictable pods on the node get drained.
o Step#3:
•Perform power operations as deemed fit for the use case.
What can cause node maintenance to fail on IBM Storage Fusion HCI?
Failures can happen when prerequisites are not met.
Examples are:
1. A node is already in maintenance with the taints added by IBM Storage Fusion HCI.
2. Scale cluster health is in DEGRADED state.
3. Any machine config operator rollout is in progress on the cluster.
4. Any compute node on the cluster is in DEGRADED state.
What can prevent nodes from going in to maintenance mode?
If Red Hat OpenShift taints are already present on a node due to the MCO (machine config operator) rollout (likely for the Red Hat OpenShift upgrade), then the node may be prevented from going into maintenance mode.
Nodes fail to drain for multiple reasons:
For example, due to active PodDisruptionBudget on a scale core pod. In some cases, if an application pod is holding files open, then the IBM Storage Scale core pod may be held up and not allowed to go down.
What is Pod Disruption Budget (PDB):
A PDB limits the number of pods in a replicated application that are down simultaneously from voluntary disruptions. For example, a quorum-based application would like to ensure that the number of replicas running is never below the number needed for a quorum. IBM Storage Scale is one such management application that has a PDB set on its core pods, which will be associated with each node to ensure the health of its quorum.
How to identify that node maintenance has failed and how to recover:
If the operation exceeds the configured maintenance window (60 minutes), the node maintenance operation is most likely to fail. You can observe a warning event (BMYCO0011) in the Events page of the IBM Storage Fusion user interface whenever the maintenance exceeds the maintenance window.
Until you stop the node maintenance by deleting the computeMaintenance CR, IBM Storage Fusion compute will continue to drain the pods from the node. Whenever the maintenance operation succeeds, an information event (BMYCO0012) gets raised.
Difference in Node maintenance between OpenShift and IBM Storage Fusion HCI:
|
RedHat OpenShift
|
IBM Storage Fusion HCI
|
Taints added
|
unschedulable: true taints: - key: node.kubernetes.io/unschedulable effect: NoSchedule
|
unschedulable: true taints: - key: isf.compute.fusion.io/drain effect: NoSchedule - key: node.kubernetes.io/unschedulable effect: NoSchedule
|
Pod eviction
|
Pods will not be evicted
|
The existing pods on the node will be evicted. Pods that are tolerating the taints added to the node will continue to exist and will restart if the node reboots.
|
Conclusion:
It is always advisable to perform node maintenance from the IBM Storage Fusion HCI user interface as it ensures zero downtime and no data loss.
References:
https://www.ibm.com/docs/en/sfhs/2.7.x?topic=racks-administering-node
https://www.ibm.com/docs/en/sfhs/2.7.x?topic=system-compute-events-error-codes
https://www.ibm.com/docs/en/sfhs/2.7.x?topic=tiisfhs-issues-related-storage-fusion-hci-system-node-drains
https://www.ibm.com/docs/en/scalecontainernative?topic=troubleshooting-identifying-applications-preventing-cluster-maintenance
Acknowledgement:
Sincere thanks to Joe Wigglesworth, Sathyanarayana Ramadas, Shajeer Mohammed and Shyamala Rajagopalan for helping review the article.