As @Paul Watson predicted, there's now internal debate within HCL/IBM as to whether this constitutes a bug or a feature request. The alleged rationale is that the CM doesn't know that a server has DELAY_APPLY set and therefore can't decide not to fail over there. My counter was that if this isn't a bug in the CM, it's certainly a bug in the engine, _especially_ since the engine knows not to allow a DELAY_APPLY RSS to be promoted to an HDR secondary:
2022-10-09 11:18:11.249 SCHAPI: Issued Task() or Admin() command "task( 'ha set secondary', 'ids__chaos_hdr__b3' )".
2022-10-09 11:18:11.561 Secondary Delay or Stop Apply: A server type change from an RS Secondary to an
HDR Secondary is not allowed when the DELAY_APPLY or STOP_APPLY
configuration parameters are enabled and the delay or stop
subsystem is active.
Disable the DELAY_APPLY or STOP_APPLY configuration parameters and
retry the operation after any saved data is applied.
2022-10-09 11:18:12.561 SCHAPI: Issued command "task( 'ha set secondary', 'ids__chaos_hdr__b3' )". This is the only record of the command issu
ance.
I can't see any case to be made that it should refuse to convert DELAY_APPLY RSS->HDR but then happily convert DELAY_APPLY RSS->PRI, _especially_ without first rolling forward.
Tagging @Art Kagel and @Lester Knutsen for additional feedback.
------------------------------
TOM GIRSCH
------------------------------
Original Message:
Sent: Fri October 14, 2022 12:55 PM
From: TOM GIRSCH
Subject: Potentially Serious HDR Failover Bug
All:
Over the weekend, we hit a serious issue that created a huge mess from which we're still recovering. 14.10.FC8. We have a three-node HDR cluster (I still want to call it "Mach 10" because I'm An Old) with one node as PRI, one node as HDR and one node as RSS. HA_FOC_ORDER is set to the default of SDS,HDR,RSS. The RSS node has DELAY_APPLY set to 12h.
We had an issue that a well-meaning troubleshooter decided meant we needed to cycle the primary. They knew to first take down the secondary but didn't think about taking down the RSS. So when the primary went down, the CM initiated a failover to the RSS. The problem is, it took the RSS directly to on-line mode without first rolling forward through the 12 hours of delayed transactions. This meant that from that point forward, production transactions were being routed to a server that was 12 hours behind. As noted above, I'm still cleaning up the giant mess.
It strikes me that the CM should never automatically fail to a server that's significantly behind production. It should either refuse to consider that node or roll it forward before making it live (I can see a case for both and for making that configurable).
It's worth noting that this is the second time I've encountered this; the first was in 12.10.FC14 and I didn't report it at the time because I believed it must have been something I did incorrectly. This time, however, I wasn't involved and it was entirely initiated by the CM.
I've opened TS010928682 in regard to this incident.
------------------------------
TOM GIRSCH
------------------------------
#Informix