Potentially Serious HDR Failover Bug

View Only

Expand all | Collapse all

Potentially Serious HDR Failover Bug

1. Potentially Serious HDR Failover Bug

0 Like
IBM Champion

TOM GIRSCH
Posted Fri October 14, 2022 12:55 PM

Reply
All:

Over the weekend, we hit a serious issue that created a huge mess from which we're still recovering. 14.10.FC8. We have a three-node HDR cluster (I still want to call it "Mach 10" because I'm An Old) with one node as PRI, one node as HDR and one node as RSS. HA_FOC_ORDER is set to the default of SDS,HDR,RSS. The RSS node has DELAY_APPLY set to 12h.

We had an issue that a well-meaning troubleshooter decided meant we needed to cycle the primary. They knew to first take down the secondary but didn't think about taking down the RSS. So when the primary went down, the CM initiated a failover to the RSS. The problem is, it took the RSS directly to on-line mode without first rolling forward through the 12 hours of delayed transactions. This meant that from that point forward, production transactions were being routed to a server that was 12 hours behind. As noted above, I'm still cleaning up the giant mess.

It strikes me that the CM should never automatically fail to a server that's significantly behind production. It should either refuse to consider that node or roll it forward before making it live (I can see a case for both and for making that configurable).

It's worth noting that this is the second time I've encountered this; the first was in 12.10.FC14 and I didn't report it at the time because I believed it must have been something I did incorrectly. This time, however, I wasn't involved and it was entirely initiated by the CM.

I've opened TS010928682 in regard to this incident.

------------------------------
TOM GIRSCH
------------------------------

#Informix
2. RE: Potentially Serious HDR Failover Bug

0 Like
IBM Champion

Paul Watson
Posted Fri October 14, 2022 01:36 PM

Reply
It mightn't be a 'bug' and might be as per 'functional spec' but sounds like crap functionality to me

Cheers
Paul

Paul Watson
Oninit LLC
+1-913-387-7529
www.oninit.com
Oninit®️ is a registered trademark of Oninit LLC

Original Message
3. RE: Potentially Serious HDR Failover Bug

0 Like
IBM Champion

TOM GIRSCH
Posted Wed October 19, 2022 02:14 PM

Reply
As @Paul Watson predicted, there's now internal debate within HCL/IBM as to whether this constitutes a bug or a feature request. The alleged rationale is that the CM doesn't know that a server has DELAY_APPLY set and therefore can't decide not to fail over there. My counter was that if this isn't a bug in the CM, it's certainly a bug in the engine, _especially_ since the engine knows not to allow a DELAY_APPLY RSS to be promoted to an HDR secondary:

2022-10-09 11:18:11.249 SCHAPI: Issued Task() or Admin() command "task( 'ha set secondary', 'ids__chaos_hdr__b3' )". 2022-10-09 11:18:11.561 Secondary Delay or Stop Apply: A server type change from an RS Secondary to an HDR Secondary is not allowed when the DELAY_APPLY or STOP_APPLY configuration parameters are enabled and the delay or stop subsystem is active. Disable the DELAY_APPLY or STOP_APPLY configuration parameters and retry the operation after any saved data is applied. 2022-10-09 11:18:12.561 SCHAPI: Issued command "task( 'ha set secondary', 'ids__chaos_hdr__b3' )". This is the only record of the command issu ance.

I can't see any case to be made that it should refuse to convert DELAY_APPLY RSS->HDR but then happily convert DELAY_APPLY RSS->PRI, _especially_ without first rolling forward.

Tagging @Art Kagel and @Lester Knutsen for additional feedback.

------------------------------
TOM GIRSCH
------------------------------

Original Message
4. RE: Potentially Serious HDR Failover Bug

0 Like
IBM Champion

Art Kagel
Posted Wed October 19, 2022 05:51 PM

Reply
I agree Tom! A delayed RSS should not be promotable until it catches up and completes all available rollforwards.

Art

------------------------------
Art S. Kagel, President and Principal Consultant
ASK Database Management Corp.
www.askdbmgt.com
------------------------------

Original Message
5. RE: Potentially Serious HDR Failover Bug

0 Like
IBM Champion

TOM GIRSCH
Posted Fri October 28, 2022 01:54 PM

Reply
Just got a bug acknowledgment from support:

APAR IT42360 - CONNECTION MANAGER FORCED FAILOVER TO DELAY_APPLY SECONDARY WITHOUT FIRST ROLLING FORWARD

------------------------------
TOM GIRSCH
------------------------------

Original Message
6. RE: Potentially Serious HDR Failover Bug

0 Like
IBM Champion

David Williams
Posted Wed June 19, 2024 04:22 PM

Reply
Hi Tom,

Did you get an fix for this issue?

Regards,
David.

------------------------------
David Williams
------------------------------

Original Message
7. RE: Potentially Serious HDR Failover Bug

0 Like
IBM Champion

TOM GIRSCH
Posted Wed June 19, 2024 04:39 PM

Reply
Not yet. Still waiting

Get Outlook for Android

Original Message

IBM Data Management Community

Connect with Db2, Informix, Netezza, open source, and other data experts to gain value from your data, share insights, and solve problems.

Informix

Potentially Serious HDR Failover Bug

TOM GIRSCHFri October 14, 2022 12:55 PM

Paul WatsonFri October 14, 2022 01:36 PM

TOM GIRSCHWed October 19, 2022 02:14 PM

Art KagelWed October 19, 2022 05:51 PM

TOM GIRSCHFri October 28, 2022 01:54 PM

David WilliamsWed June 19, 2024 04:22 PM

TOM GIRSCHWed June 19, 2024 04:39 PM

1. Potentially Serious HDR Failover Bug

2. RE: Potentially Serious HDR Failover Bug

3. RE: Potentially Serious HDR Failover Bug

4. RE: Potentially Serious HDR Failover Bug

5. RE: Potentially Serious HDR Failover Bug

6. RE: Potentially Serious HDR Failover Bug

7. RE: Potentially Serious HDR Failover Bug