IBM TechXchange Storage Scale (GPFS) Global User Group

 View Only

 Recovering data from a partially broken GPFS 3.5 cluster (all NSDs in unrecovered state)

David Rebatto's profile image
David Rebatto posted Fri October 17, 2025 09:24 AM

Hi everyone,

I’m trying to recover as much data as possible from a very old and partially broken GPFS 3.5.0-26 cluster composed of three NSD servers.

  • One server only provides a tie-breaker disk (no data or metadata).

  • The other two each provide one metadata and four data NSDs, and are configured as cNFS servers.

  • The file system is configured with replica=2 for both data and metadata.

After a power outage, the two data servers started showing issues:

  • One had a disk with multiple read errors and some bad sectors.

  • The other came up with all its NSDs (both data and metadata) in ‘down’ state.

Unfortunately, this went unnoticed for some days, and users continued accessing the file system despite occasional I/O errors.

When I discovered the issue, I suspended the faulty disk and issued mmchdisk start on the ‘down’ NSDs.
During the metadata scan, GPFS triggered an SGPanic on the servers with the 'up' disks, causing an automatic reboot.
As a result, the mmchdisk operation failed, leaving those disks in ‘unrecovered’ state.
Moreover, after the reboot, all NSDs on the second server were also in ‘down’ state.

Now, any subsequent attempt to start or recover the NSDs fails, and all of them are stuck in ‘unrecovered’ state.

The file system–specific commands (mmlsfs, mmrepquota, etc.) still work, so it doesn’t seem completely lost.

👉 Question:
Is there any way to forcibly mount the file system — even at the risk of losing some data — just to extract whatever is still readable?
Or, alternatively, any procedure to bring NSDs back to a minimal functional state?

Thanks in advance for any suggestions.