View Only
  • 1.  Potentially Serious HDR Failover Bug

    IBM Champion
    Posted Fri October 14, 2022 12:55 PM


    Over the weekend, we hit a serious issue that created a huge mess from which we're still recovering. 14.10.FC8. We have a three-node HDR cluster (I still want to call it "Mach 10" because I'm An Old) with one node as PRI, one node as HDR and one node as RSS. HA_FOC_ORDER is set to the default of SDS,HDR,RSS. The RSS node has DELAY_APPLY set to 12h.

    We had an issue that a well-meaning troubleshooter decided meant we needed to cycle the primary. They knew to first take down the secondary but didn't think about taking down the RSS. So when the primary went down, the CM initiated a failover to the RSS. The problem is, it took the RSS directly to on-line mode without first rolling forward through the 12 hours of delayed transactions. This meant that from that point forward, production transactions were being routed to a server that was 12 hours behind. As noted above, I'm still cleaning up the giant mess.

    It strikes me that the CM should never automatically fail to a server that's significantly behind production. It should either refuse to consider that node or roll it forward before making it live (I can see a case for both and for making that configurable).

    It's worth noting that this is the second time I've encountered this; the first was in 12.10.FC14 and I didn't report it at the time because I believed it must have been something I did incorrectly. This time, however, I wasn't involved and it was entirely initiated by the CM.

    I've opened TS010928682 in regard to this incident.


  • 2.  RE: Potentially Serious HDR Failover Bug

    Posted Fri October 14, 2022 01:36 PM
    It mightn't be a 'bug' and might be as per 'functional spec' but sounds like crap functionality to me 


    Paul Watson
    Oninit LLC
    Oninit®️ is a registered trademark of Oninit LLC

  • 3.  RE: Potentially Serious HDR Failover Bug

    IBM Champion
    Posted Wed October 19, 2022 02:14 PM

    As @Paul Watson predicted, there's now internal debate within HCL/IBM as to whether this constitutes a bug or a feature request. The alleged rationale is that the CM doesn't know that a server has DELAY_APPLY set and therefore can't decide not to fail over there. My counter was that if this isn't a bug in the CM, it's certainly a bug in the engine, _especially_ since the engine knows not to allow a DELAY_APPLY RSS to be promoted to an HDR secondary:

    2022-10-09 11:18:11.249  SCHAPI: Issued Task() or Admin() command "task( 'ha set secondary', 'ids__chaos_hdr__b3' )".
    2022-10-09 11:18:11.561  Secondary Delay or Stop Apply: A server type change from an RS Secondary to an
            HDR Secondary is not allowed when the DELAY_APPLY or STOP_APPLY
            configuration parameters are enabled and the delay or stop
            subsystem is active.
            Disable the DELAY_APPLY or STOP_APPLY configuration parameters and
            retry the operation after any saved data is applied.
    2022-10-09 11:18:12.561  SCHAPI: Issued command "task( 'ha set secondary', 'ids__chaos_hdr__b3' )". This is the only record of the command issu

    I can't see any case to be made that it should refuse to convert DELAY_APPLY RSS->HDR but then happily convert DELAY_APPLY RSS->PRI, _especially_ without first rolling forward.

    Tagging @Art Kagel and @Lester Knutsen for additional feedback.



  • 4.  RE: Potentially Serious HDR Failover Bug

    IBM Champion
    Posted Wed October 19, 2022 05:51 PM
    I agree Tom! A delayed RSS should not be promotable until it catches up and completes all available rollforwards.


    Art S. Kagel, President and Principal Consultant
    ASK Database Management Corp.

  • 5.  RE: Potentially Serious HDR Failover Bug

    IBM Champion
    Posted Fri October 28, 2022 01:54 PM

    Just got a bug acknowledgment from support: