IBM TechXchange Storage Scale (GPFS) Global User Group

 View Only
  • 1.  GPFS split-brain condition?

    Posted Fri December 27, 2019 08:48 AM
    Edited by Lou White Tue December 31, 2019 07:36 AM
    Posting this query here given the latest IBM Community Forum migration announcement

    I recently created a 7 node GPFS 5.0.4.0 cluster on zLinux for testing purposes with a single quorum node.  The cluster was working as expected until I decided to use 'mmchcluster --ccr-enable' to configure a cluster configuration repository.

    Upon running the 'mmchcluster --ccr-enable' command, I received several errors reflecting the inability to propogate the new cluster configuration to various nodes, though all nodes have the latest version of /var/mmfs/gen/mmfsdrms based on both time stamp and file size.

     

    The cluster is now in a state where the original node designated as the primary cluster configuration server will fail after issuing the 'mmstartup' command with:

    "get file failed: Not enough CCR quorum nodes available (err 809)

    gpfsClusterInit: Unexpected error from ccr fget mmsdrfs.  Return code: 158

    mmstartup: Command failed. Examine previous error messages to determine cause."

     

    Additionally, the node I ran the 'mmchcluster –ccr-enable' command from will start the gpfs daemon upon issuing 'mmstartup', but 'mmgetstate' returns a state of "arbitrating".

     

    In an attempt to recover a quorum, trying 'mmchnode –quorum -N <node_name>' on the single cluster node where the gpfs daemon is running in an arbitrating state returns:

    "mmchnode: Unable to obtain the GPFS configuration file lock. Retrying ...

    mmchnode: Unable to obtain the GPFS configuration file lock.

    mmchnode: GPFS was unable to obtain a lock from node

    mmchnode: Command failed. Examine previous error messages to determine cause.

     

    I was able to determine that the node holding the configuration file lock with 'mmcommon showLocks' to find that a mmSdrLock is active, but have been unable to free the lock with 'mmcommon freeLocks mmSdrLock', not surprisingly receiving the following error:

    "vput failed: Not enough CCR quorum nodes available (err 809)

    setRunningCommand: Unexpected error from ccr vput mmRunningCommand .  Return code: 158"

     

    Looking forward to any thoughts regarding recovery options.  Thank you!



    ------------------------------
    Lou White
    ------------------------------


  • 2.  RE: GPFS split-brain condition?

    Posted Tue December 31, 2019 02:08 PM
    Edited by Lou White Fri January 03, 2020 02:38 PM
    Issue resolved after reviewing this useful post:  

    https://www.ibm.com/developerworks/community/forums/html/topic?id=7ca10918-7bdd-4327-956e-d392dc651ae0 and finding that the linux firewalld service wasn't properly configured given prescriptive guidance in the Spectrum Scale Administrator guide.



    ------------------------------
    Lou White
    ------------------------------