File and Object Storage

File and Object Storage

Software-defined storage for building a global AI, HPC and analytics data platform 

 View Only

IBM Spectrum Scale MMFSCK – Savvy Enhancements

By Archive User posted Fri January 05, 2018 11:31 AM

  
Over the years there have been many enhancements made in the mmfsck tool to improve it in three areas - Performance, Functionality and Usability. Below in this blog I am highlighting the cool features which were added in mmfsck in the past four IBM Spectrum Scale releases.



Introduction


File System ChecKer (FSCK) is an IBM Spectrum Scale (GPFS) tool that checks and repairs a GPFS file system.
FSCK is invoked via the command 'mmfsck'.

The primary job of mmfsck is to scan all metadata in a given file system and prompt to fix any inconsistencies that are discovered.
Such inconsistencies can occur due to distinct reasons, like -

    • Loss of uncommitted data (including logs/journal) to the disk
    • Disk failures
    • Faulty hardware (disk, network card, memory)
    • Software errors in GPFS, e.g., incorrect recovery from node failures


And then when a corrupt metadata is accessed during regular file system operations, it can cause any of the below symptoms -

    • File system panic (SGPanic)
    • Assertions/signals causing node failure
    • Invalid data reported (“???” seen during ls)
    • Incorrect data returned
    • File system operation failing with I/O error
    • System logs show MMFS_FSSTRUCT log entries


An MMFS_FSSTRUCT log entry is a very good reason to suspect file system corruption. However, in some cases such entries may be created for non-persistent or non-fatal errors also.

Below is an example of a MMFS_FSSTRUCT log entry in system logs:
Jan 1 05:54:36 node1 mmfs: Error=MMFS_FSSTRUCT, ID=0x94B1F045, Tag=13386052: Invalid disk data structure.
Error code 1108. Volume fs1 . Sense Data
Jan 1 05:54:36 node1 mmfs: Error=MMFS_FSSTRUCT, ID=0x94B1F045, Tag=13386052: 04 54 00 01 00 00 00 03
Jan 1 05:54:36 node1 mmfs: Error=MMFS_FSSTRUCT, ID=0x94B1F045, Tag=13386052: 00 00 00 00 11 7F E0 00
Jan 1 05:54:36 node1 mmfs: Error=MMFS_FSSTRUCT, ID=0x94B1F045, Tag=13386052: 00 08 00 01 00 00 00 03
Jan 1 05:54:36 node1 mmfs: Error=MMFS_FSSTRUCT, ID=0x94B1F045, Tag=13386052: 00 00 00 00 11 7F E0 00
Jan 1 05:54:36 node1 mmfs: Error=MMFS_FSSTRUCT, ID=0x94B1F045, Tag=13386052: 00 00 10 00 00 00 00 00
Jan 1 05:54:36 node1 mmfs: Error=MMFS_FSSTRUCT, ID=0x94B1F045, Tag=13386052: 00 00 00 04 09 00 20 03
Jan 1 05:54:36 node1 mmfs: Error=MMFS_FSSTRUCT, ID=0x94B1F045, Tag=13386052: 00 00 00 00 00 00 00 00
...
Jan 1 05:54:36 node1 mmfs: Error=MMFS_FSSTRUCT, ID=0x94B1F045, Tag=13386052: 00 00 00 00
Jan 1 05:54:36 node1 mmfs: Error=MMFS_FSSTRUCT, ID=0x94B1F045, Tag=13386052:





If you see any of the above symptoms during regular file system operation (and especially the MMFS_FSSTRUCT log entry) then it indicates that the file system has corruptions and needs to be checked by running mmfsck.

The offline version of mmfsck is the last line of defense for a file system that cannot be used. The mmfsck command runs on the file system manager and reports status to the invoking node. It is mutually incompatible with any other use of the file system and checks for any running commands or any nodes with the file system mounted. It exits if any are found.

The mmfsck command performs a full file system scan looking for metadata inconsistencies. This process can be lengthy on large file systems. It seeks permission from the user to repair any problems that are found, which may result in the removal of files or directories that are corrupt. Hence, offline mmfsck should be run judiciously and under the guidance of IBM® Support Center. The processing of this command is similar to those for other file systems.

For more information on mmfsck please refer to the IBM Knowledge Center link



Features added in MMFSCK




1. In GPFS V4.2.1
a) mmfsck patch queue feature
Earlier mmfsck used a fixed length inode problem list to track the file system corruptions on a per inode basis. If the file system had a large number of corruptions this problem list would over flow and then customer would require multiple mmfsck runs to completely repair the file system. The extra mmfsck runs also added to the file system down time.

So, to solve this functionality issue in 4.2.1 we added the patch queue feature to mmfsck. Now when a corruption is detected by mmfsck, instead of adding an entry to the inode problem list, it enqueues a patch describing the corruption into a global patch queue. A patch is a complete and self-contained description of a single corruption in the file system. The data in a patch is sufficient to both repair the corruption represented by the patch as well as describe the corruption in user understandable form. To avoid the patch queue from overflowing a thread called patch dequeue thread is created and run in parallel in a continuous loop looking for patch entries in the patch queue. When it finds one, it dequeues the entry and processes it.

Thus, with this feature the need to run mmfsck more than once to fix problems has mostly been eliminated.

2. In GPFS V4.2.2
a) Performance improvement for mmfsck directory scanning
The mmfsck tool did not use to scale well when the file system either had substantial number of directories or had highly uneven distribution of directory inode numbers in the inode space. And this used to cause performance bottleneck in the directory check phase of mmfsck.

So, to improve performance in 4.2.2 we made a change in the way the code reads the directories and directory blocks by taking locks. As directory scanning is only done in offline mmfsck we found that there was no reason to take locks on them and to avoid unnecessary lock contentions the code was changed to do lockless read of the directory and directory blocks. In turn helping to improve the performance of directory scan.

b) Optimize performance by avoiding repeated reserved file metadata scan
We found that mmfsck was unnecessarily scanning the reserved files blocks twice in separate mmfsck scan phases to populate the in-core block allocation map (populated by mmfsck to check against the on-disk allocation map). So, to optimize performance in 4.2.2, a change was made to avoid scanning the reserved files blocks twice in turn reducing the overall file system scan time.

3. In GPFS V4.2.3
a) Improve performance by combining inode scans for multiple storage pools
When scanning a file system having multiple storage pools mmfsck would be required to scan all the inodes in the file system multiple times (once for each storage pool). This was required to populate the in-memory version of the block allocation map to check against the on-disk version. There is one block allocation map for each storage pool in the file system. And mmfsck created the in-memory block allocation map for each storage pool in multiple iterations where each iteration would incur scan of all inodes in the file system. The multiple inode scans increased mmfsck time to completion.

So, to solve this performance issue in 4.2.3, a change was made to create a single in-memory version of the block allocation map which includes segments from the block allocation map of all the storage pools in the file system. Thus, reducing mmfsck completion time by combining inode scans for multiple storage pools under a single iteration.

b) Classify file system corruptions as Critical, Normal or Benign
When scanning a file system mmfsck reports corruptions in the form of prompts to fix a specific problem and also as a summary list of inodes with corruptions. If any corruptions are found, mmfsck exits with non-zero status. Now there may be corruptions which are not serious enough to destabilize file system or are in the nature of cleanup of incomplete metadata transactions (like lost blocks). Since mmfsck did not give any indication of the severity of the corruption, a file system administrator had to assume all corruptions reported by mmfsck to be critical. This caused unnecessary alarm and possible delayed down times while the nature of the corruption was verified by IBM support.

So, to solve this usability issue in 4.2.3 we classified the severity of corruptions reported by mmfsck as Critical, Noncritical and Harmless. Thus, helping the end user to make informed decisions about scheduling downtime to repair file system based on the severity of corruptions. In case file system contains only harmless errors, end user may decide to skip repair of such errors entirely.

Example mmfsck output:
The following command checks file system fs2 and displays inconsistencies but does not try to make repairs:

mmfsck fs2 -vn

The command displays output similar to the following example:
Checking "fs2"
FsckFlags 0x2000009
Stripe group manager





4. In GPFS V5.0.0
a) Dump detailed current mmfsck scan state using mmfsck “–status-report”
When scanning a file system the user can use 'mmfsadm dump fsck’ to dump the overall mmfsck state. But in some cases, it did not provide sufficient data regarding the current progress made
by mmfsck, like how much has been scanned so far and also what each node is currently scanning. And in cases where mmfsck was running on a large file system or file system with lots of corruptions it became difficult for IBM support and end user to determine the progress of mmfsck when mmfsck did not display any progress output due to either being run with lower verbose mode or when mmfsck might be genuinely hung at a certain check.

So, to solve this usability issue in 5.0.0 we enhanced the ‘mmfsadm dump fsck’ output to display more useful information on current scan state of mmfsck on a file system. Also in 5.0.0 we have introduced a new mmfsck option ‘–status-report’ which will display a consolidated status report of mmfsck scan state with information from all the nodes that are participating in the scan of the file system. While a long-running instance of mmfsck is running, you can run mmfsck with the ‘--status-report’ option to verify that the long-running instance is still working and to get the current status.




IBM Spectrum Scale MMFSCK Defect Fixes



Along with the above features there were also a few defect fixes made in mmfsck, listing below a few high-impact mmfsck defects fixed in recent IBM Spectrum Scale releases:

1. APAR IV99105 (D.1032034: Offline fsck does not repair all ind block replicas)
Offline mmfsck did not repair all indirect block replicas in reserved files which lead to more corruptions during file system use. This happened because internally mmfsck was using wrong block size when comparing reserved file indirect block replicas and this caused it to wrongly assume that mismatched indirect block replicas of reserved files are identical.
Defect fix is available with 4.2.3 PTF5

2. APAR IV95015 (D.1022080: Fsck reports false positive duplicate fragments when snapshot is corrupt)
Offline mmfsck reported false positive duplicate fragments when the file system had a snapshot with corrupt indirect blocks and falsely fixing them then leads to data loss.
Defect fix is available with 4.2.3 PTF1

3. APAR IV95015 (D.1022077: Offline fsck repair with patch fails: err 214)
Offline mmfsck when fixing block allocation map corruption aborts with err 214 and message – “Allocation map bit existing value does not match expected value “
Workaround is to either disable mmfsck patch queue feature using – “mmdsh -N all mmfsadm test fsck usePatchQueue 0”, or do the offline mmfsck repair using patch file feature.
Defect fix is available with 4.2.3 PTF1

4. APAR IV90865 (D.1010788: Offline fsck reports false positive lost blocks in mixed cluster)
In a mixed version cluster having 4.2.2 and pre-4.2.2 nodes, offline mmfsck would report false positive lost blocks in some cases when a pre-4.2.2 node became the mmfsck master node.
Defect fix is available with 4.2.2 PTF1

6. APAR IV87569 (D.993411: PMR DE28480: Offline fsck hits dbgassert)
When running offline mmfsck for a file system that has one or more files with multiple trailing duplicate blocks mmfsck would assert with 'iStatusP->lastValidDiskAddrs >= 0'.
Defect fix is available with 4.1.1 PTF9 and 4.2.0 PTF4






Acknowledging my colleagues Felipe Knop, Karthik Iyer, Sandeep Ramesh, Sasikanth Eda for their valued review and work around this topic.
#Softwaredefinedstorage
0 comments
4 views

Permalink