Cloud Pak for Data

Come for answers. Stay for best practices. All we’re missing is you.

View Only

Back to Blog List

Resolving Cloud Pak for Data Volume Multi-Attach Errors: A Step-by-Step Guide

By Hongwei Jia posted Sun February 02, 2025 12:54 AM

Resolving Cloud Pak for Data Volume Multi-Attach Errors: A Step-by-Step Guide

1. Introduction

This document provides a step-by-step guide to troubleshoot a volume multi-attach error that prevents pods from initializing in a Cloud Pak for Data cluster.
This error occurs when a Persistent Volume Claim (PVC) in ReadWriteOnce (RWO) mode is already attached to one node, and OpenShift is attempting to attach it to another.
The goal is to diagnose the root cause, typically an orphaned or zombie container, and resolve the issue to allow the pod to start successfully.

2.Symptom

A pod stuck in Init status with the error message as follow.

Multi-Attach error for volume “pvc-xxx-yyy” Volume is already exclusively attached to one node and can’t be attached to another.

In this example, the pod elasticsea-0ac3-ib-6fb9-es-server-esnodes-2 stuck in Init status.

When viewing this error in the pod Events tab in OpenShift web console, it looks like below.

3. Understanding the Problem

Root Cause: The primary cause is the existence of "orphaned" or "zombie" containers

These are containers that continue running on a node even after their corresponding pod has been deleted from the OpenShift (or Kubernetes) API

Persistent Volume Claims (PVCs): A PVC is a request for storage by a user. When a pod uses a PVC, it is bound to that pod. In ReadWriteOnce (RWO) mode, a PVC can only be mounted to a single pod on a single node at any given time.

Several factors can lead to orphaned containers:

Issues with the kubelet or container runtime.

Communication problems between the API server and kubelet, which might delay or lose the delete command.

Stuck termination processes.Runtime bugs in CRI-O, Docker, or containerd, which can hinder proper container shutdown.
A container failing to handle termination signals or experiencing delays in a graceful shutdown.

Network or node problems.

4. Troubleshooting Steps

Step 1: Identify the Node the Volume is Attached To

Use the following command to find the node where the PVC is attached:

oc get volumeattachment | grep -i pvc |grep pvc-3a7df857-c1ef-4ff0-b7c3-28c109897e36

Note: Replace pvc-3a7df857-c1ef-4ff0-b7c3-28c109897e36 with the relevant PVC name.

Output Example:

csi-9c5e2fe8366ec733895654cebe6642806badd2dc94a64a113bfb5b68dccd5d44 openshift-storage.rbd.csi.ceph.com pvc-3a7df857-c1ef-4ff0-b7c3-28c109897e36 10.240.64.99 true 8d

As we can see from the output, 10.240.64.99 is the node to which the volume is attached.

Step 2: Identify the Node the Pod is Scheduled On

Use this command to find the node where the problematic pod is scheduled:

oc get pods -o wide | grep elasticsea-0ac3-ib-6fb9-es-server-esnodes-2

Output Example:

elasticsea-0ac3-ib-6fb9-es-server-esnodes-2 0/2 Init:0/3 0 48m <none> 10.240.64.102 <none> <none>

As we can see from the output, the pod elasticsea-0ac3-ib-6fb9-es-server-esnodes-2 is scheduled on the node 10.240.64.102 which is different from the node 10.240.64.99 to which the volume pvc-3a7df857-c1ef-4ff0-b7c3-28c109897e36 attached.

Note: Replace elasticsea-0ac3-ib-6fb9-es-server-esnodes-2 with the relevant pod name.

Step 3: Check for Zombie Containers

1)Access the node where the volume is attached

Access the node using the following command:

oc debug node/10.240.64.99

A debugging pod will be created.

When seeing below message, it means you are able to operate in the node 10.240.64.99.

To use host binaries, run `chroot /host` Pod IP: 10.240.64.99 If you don't see a command prompt, try pressing enter.

Once you are inside the debug pod, you'll need to access the host environment:

sh-4.4# chroot /host

2)Check for a zombie container

Check for a zombie container using below command.

sh-4.4# crictl ps | elasticsea-0ac3-ib-6fb9-es-server-esnodes-2

Note: Replace elasticsea-0ac3-ib-6fb9-es-server-esnodes-2 with the relevant pod name.

Output Example:

d3eccbfca3696 1298984fb6ba880b8792d1046b914b45192d8af0aa610ea222f08560b93dc2dc 7 days ago Running elasticsearch 0 9d78d517a9101 elasticsea-0ac3-ib-6fb9-es-server-esnodes-2

As we can see a container with the same name as the problematic pod is present, it confirms the existence of a zombie container

The output provides the container ID 9d78d517a9101 which you will need in the next step.

Step 4: Resolve the Zombie Container

Attempt to stop and remove the zombie container using the container ID from the previous step:

sh-4.4# crictl stopp 9d78d517a9101

sh-4.4# crictl rmp 9d78d517a9101

If the stop command fails with an error like rpc error: code = DeadlineExceeded desc = context deadline exceeded, it indicates the container is not responding and you will need to restart the crio service.

Important Note: Restarting crio will impact all pods running on the node. Schedule a maintenance window to avoid disruptions to production usage.

Restart crio using the following command:

sh-4.4# systemctl restart crio

5. General Advice

It’s generally advisable to avoid force deletion when possible. Force deleting a pod bypasses the normal graceful shutdown sequence, which can leave underlying resources, like persistent volumes, in an uncertain state. This can trigger multi-attach errors if the volume is mistakenly left attached to a node.

Review the resource usage of your nodes and cluster. Insufficient resources can lead to various issues, including pods failing to start. If resources are consistently low, consider scaling up your cluster by adding more nodes or increasing the resources available to existing nodes.

By following these steps, you should be able to identify and resolve volume multi-attach issues caused by orphaned containers, allowing your pods to start successfully.

#Spotlight
#Highlights-home

0 comments

20 views

Permalink

https://community.ibm.com/community/user/blogs/hongwei-jia/2025/02/01/resolving-cloud-pak-for-data-volume-multi-attach-e

Cloud Pak for Data

Cloud Pak for Data

Resolving Cloud Pak for Data Volume Multi-Attach Errors: A Step-by-Step Guide

By Hongwei Jia posted Sun February 02, 2025 12:54 AM

Resolving Cloud Pak for Data Volume Multi-Attach Errors: A Step-by-Step Guide

1. Introduction

2.Symptom

3. Understanding the Problem

4. Troubleshooting Steps

Step 1: Identify the Node the Volume is Attached To

Step 2: Identify the Node the Pod is Scheduled On

Step 3: Check for Zombie Containers

Step 4: Resolve the Zombie Container

5. General Advice

Permalink

Additional
Resources

Office

Quick Links

Cloud Pak for Data

Cloud Pak for Data

Resolving Cloud Pak for Data Volume Multi-Attach Errors: A Step-by-Step Guide

By Hongwei Jia posted Sun February 02, 2025 12:54 AM

Resolving Cloud Pak for Data Volume Multi-Attach Errors: A Step-by-Step Guide

1. Introduction

2.Symptom

3. Understanding the Problem

4. Troubleshooting Steps

Step 1: Identify the Node the Volume is Attached To

Step 2: Identify the Node the Pod is Scheduled On

Step 3: Check for Zombie Containers

Step 4: Resolve the Zombie Container

5. General Advice

Permalink

Additional Resources

Office

Quick Links

Additional
Resources