File and Object Storage

Spectrum Scale Node Failure Recovery Times

By GUANG LEI LI posted 9 days ago

There is a thread in GPFS to check and/or renew leases periodically. If the current node is cluster manager(cfgmgr), it checks whether other nodes are renewing their leases in a timely fashion and starts a failure protocol, if not. For other nodes, it checks whether it is time to renew the lease and, if so, sends a lease request message to the cluster manager.

GPFS will ping a node every 2 seconds(could be changed by pingPeriod) before declaring it dead and this is used only by the cfgmgr and quorum nodes. It checks for response to previous ping and send another one if there is hope that the node is still alive. GPFS will stop pinging the node if any of the following conditions is met:

  1. The socket connection is broke
  2. There are too many consecutive missed pings
  3. totalPingTimeout exceeded.

We can find out the parameters for 2nd and 3rd from "mmfsadm dump cfgmgr" and they are "missedPingTimeout" and "totalPingTimeout". e.g.:

lease config: dynamic yes failureDetectionTime 35.0 usePR no recoveryWait 35 dmsTimeout 23
leaseDuration 35.0/23.3 renewalInterval 30.0/11.7 renewalTimeout 5.0 fuzz 3.00/1.17
missedPingTimeout 15x2.0=30.0 totalPingTimeout 60x2.0=120.0

missedPingTimeout specifies the max time of consecutive missed ping allowed. It is set initially as (leaseRecoveryWait - 5.0) if it's disk leasing without PR, or leaseDuration/2 if it's using disk fencing(PR enabled). But it also honors configurable bounds on missed ping timeout set by maxMissedPingTimeout and minMissedPingTimeout. So it will be adjusted to be no smaller than minMissedPingTimeout and no bigger than maxMissedPingTimeout, and the minimal value is 6*2.0=12 seconds. In above example, missedPingTimeout is set to 15x2.0=30, which means max 15 consecutive missed ping allowed and total allowed time of consecutive missed ping is 30 seconds(ping period is 2 seconds).

totalPingTimeout means the max time of pinging allowed. It's an external config parameter and default value is 120 seconds. It will be adjusted further to be no smaller than missedPingTimeout.

So in a summary, missedPingTimeout and totalPingTimeout define the timeouts for sending pings to confirm a suspected node is really dead before we kick it out of the group and run the node recovery. If the node responds to pings, we'll wait up to 'totalPingTimeout' seconds after its lease runs out before we declare it dead. If the node does NOT respond to pings, we will declare it dead sooner: if during the time given by totalPingTimeout, there is ever a period of 'missedPingTimeout' seconds where we do not receive any ping replies, we'll declare the node dead. The minMissedPingTimeout and maxMissedPingTimeout parameters allow limiting the missed ping timeout, in case leaseRecoveryWait is set unusually small or unusually large. If pinging ends, but leaseRecoveryWait has not expired yet, then cluster manager node will expel that node from the cluster right away, but it will delay the node recovery until leaseRecoveryWait expires. Before node recovery completes, the expelled node is not allowed to re-join the cluster.

The external config parameters for above descriptions are:
  • minMissedPingTimeout, default 3 seconds
  • maxMissedPingTimeout, default 60 seconds
  • totalPingTimeout, default 120 seconds
  • leaseRecoveryWait, default to 35 seconds. It's not documented externally so I don't recommend to change it.

So the behavior for different nodes after lease expires:
  1. If cluster manager(cfgmgr) detects lease for a node expired, it will ping that node. It won't declare that node as failed until the pinging ended for that node. Then cfgmgr will expel that node immediately, and start the recovery for that node if leaseRecoveryWait expires, or wait to start the recovery for that node until leaseRecoveryWait expires.
  2. If the node is a non-quorum node and it failed to renew its lease in time, then it will keep trying to renew its lease from the cluster manager. If it failed to renew its lease before being expelled by cfgmgr, then it's not allowed to re-join before gpfs completes the node recovery for it.
  3. If the node is a quorum node but it's not cluster manager, then it pings the cfgmgr node if it does not respond to our lease request timely. If the pinging ended but this node still can't renew its lease from cluster manager, then this quorum node will contact other quorum nodes and try to take over as the new cluster manager.

This graph shows the time-line of events after a node failure, where the failed node was not the cluster manager

a) The last time the failed node renewed its lease
b) The cluster manager detects that the lease has expired, and starts pinging the node
c) The cluster manager decides that the node is dead and runs the node failure protocol
d) The file system manager starts log recovery