Db2

Db2

Connect with Db2, Informix, Netezza, open source, and other data experts to gain value from your data, share insights, and solve problems.

 View Only

IBM db2 pureScale cluster peer domain

  • 1.  IBM db2 pureScale cluster peer domain

    Posted 2 days ago

    GPFS (IBM Spectrum Scale) cluster is operational - all file systems are mounted and the mmgetstate -a output shows every node as active.
    However, the underlying RSCT (Reliable Scalable Cluster Technology) layer - which GPFS depends on for communication, quorum, and cluster membership - is not fully healthy.


    What's Working

    • GPFS daemons (mmfsd) are running on all nodes.
    • File systems (db2fs1, logfs) are mounted across all nodes.
    • Cluster manager and filesystem managers are correctly assigned.
    • Data access and I/O are functional.

    What's Broken

    • The RSCT peer domain (db2domain_20241208010205) is online only on node1-database and node1-database-cf.
    • The two node2 systems (node2-cbs-database, node2-database-cf) are offline in RSCT - lsrpnode shows them as Offline.
    • When attempting to rejoin or start them using startrpnode, the system returns:

     

    • → This means RSCT security trust is broken between the node1 and node2 pairs.
    2610-441 Permission is denied to access the resource class specified in this command.
    Network Identity UNAUTHENT requires 's' permission for the resource class IBM.PeerDomain on node node2-cbs-database.

    Root Cause

    During RSCT reconfiguration (recfgct) on the node2 systems:

    • Their RMC security credentials (certificates and keystores) were regenerated under /var/ct/cfg/.
    • The domain owner (node1) still holds the old certificates.
    • As a result, node1 and node2 can't authenticate each other in the RSCT peer domain.
    • RSCT therefore flags node2s as UNAUTHENTICATED (UNAUTHENT) and refuses to start them in the cluster.

    Impact

    • RSCT cluster membership is partial - quorum and event management are unreliable.
    • GPFS continues to run using cached configuration, but:
      • No automatic failover or fencing will occur.
      • If a node restarts or GPFS restarts, it may fail to rejoin until RSCT is fixed.
      • DB2 pureScale or PowerHA services relying on RSCT communication will also fail to detect node state properly.

    Next Steps (Resolution Path)

    1. On the two offline nodes, reset RSCT security and configuration:
      • Stop RMC (rmcctrl -z)
      • Remove /var/ct/cfg/ctrmc.* files
      • Rebuild RSCT (recfgct)
      • Restart RMC (rmcctrl -A)
    2. On all nodes (including node1s), restart RMC so new trust keys propagate.
    3. Use startrpnode from node1 to rejoin the node2 systems.
    4. Verify that lsrpnode shows all four nodes Online and that GPFS remains healthy.

    Summary Statement for ReportThe GPFS cluster is operational, but RSCT domain membership is partially degraded due to broken RMC authentication between node1 and node2 pairs. This occurred after the RSCT reconfiguration regenerated new RMC certificates on node2 systems, invalidating trust with the domain owner.
    Corrective action involves reinitializing RMC security and rejoining the node2 systems to the RSCT peer domain to restore full cluster quorum and event coordination.


    Would you like me to phrase this as a formal "Incident Summary and Root Cause Analysis" section that you can paste directly into your maintenance or problem log (with "Cause," "Impact," "Resolution," "Verification" headings)?

     



    ------------------------------
    jaison chipuka
    ------------------------------