Potentially Serious HDR Bug 14.10.FC7W1

2. RE: Potentially Serious HDR Bug 14.10.FC7W1

Like

IBM Champion

TOM GIRSCH

Posted Wed June 01, 2022 12:52 PM

One additional note: I checked the clocks on both servers, and they seem to be in sync. Both use ntpd. So it's interesting that the messages in the log files are almost a minute apart between primary and secondary.

------------------------------
TOM GIRSCH
------------------------------

Original Message

Original Message:
Sent: Wed June 01, 2022 12:50 PM
From: TOM GIRSCH
Subject: Potentially Serious HDR Bug 14.10.FC7W1

All:

I wanted to make you aware of something we've hit a couple of times and also find out if someone else has encountered it. Twice in the past few months, I've had DR Ping faliures on our HDR primary which seem to have left the cluster in an unstable state. It's as if the primary lost track of the secondary but doesn't realize that it's not there. In online.log on the primary, it looks like this:

09:33:14  DR: Needed to send a ping message but failed. 109:33:44  DR: Needed to send a ping message but failed. 109:34:14  DR: Needed to send a ping message but failed. 109:34:44  DR: Needed to send a ping message but failed. 109:34:54  DR: ping timeout

On the secondary, it looks like this:

2022-05-23 09:35:04  DR: ping timeout2022-05-23 09:35:04.639  DR: Receive error2022-05-23 09:35:04.640  dr_secrcv thread : asfcode = -25582: oserr = 4: errstr = : Network connection is broken. System error = 4.2022-05-23 09:35:04.640  DR_ERR set to -12022-05-23 09:35:04.640  DR: Receive Btree error2022-05-23 09:35:05.639  DR: Turned off on secondary server

Notice that the "turned off on secondary server" message appears on the secondary but there's no such indication on the primary. An onstat -g cluster from the primary still shows the secondary as connected and active. But it's not.

What happens next is that the next time the primary tries to run a checkpoint, it waits for the secondary to acknowledge it, and that never happens because the secondary thinks it's disconnected while the primary thinks it is still connected. Users can still connect to the primary and even run queries, but as soon as they execute anything that would do a logged write, they get blocked on the pending checkpoint. An onstat - on the primary shows CKPT_INP.

When it gets to this point, the primary will no longer respond to onmode commands. The only way to clear the situation is to take down all the other servers in the cluster, then kill the primary using onclean -ky. At that point, it fires right back up as if nothing happened. And I can re-start the other nodes without incident.

Obviously, this should never happen. If the HDR secondary isn't responding, it should time out and be kicked out of the cluster. The primary shouldn't hang for an eternity waiting around for it.

In case it's relevant, we have DRAUTO set to 3 and we're using NEAR_SYNC mode.

I have a case (TS009442018) open on this. Problem is, support wants me to replicate the issue and grab stats next time it happens, which isn't something I'm super keen on doing in production. (For the time being, to avoid the issue, I've demoted the HDR secondary to an RSS.)

------------------------------
TOM GIRSCH
------------------------------

#Informix

4. RE: Potentially Serious HDR Bug 14.10.FC7W1

Like

Robert Wolanski

Posted Wed June 08, 2022 09:17 AM

Hi,
I had similar problem (on Friday, the thirteenth :) ) - case TS009352831, but on version FC4.
Data did not replicate to the secondary server.
On primary server:
16:42:39 Logical Log 20260756 Complete, timestamp: 0xf0007231.
16:42:39 Logical Log 20260756 - Backup Started
16:42:43 Logical Log 20260756 - Backup Completed
16:43:27 DR: Needed to send a ping message but failed. 1
16:46:27 DR: Needed to send a ping message but failed. 1
16:46:57 DR: Needed to send a ping message but failed. 1
16:47:27 DR: Needed to send a ping message but failed. 1
16:47:57 DR: Needed to send a ping message but failed. 1
16:48:01 DR: ping timeout
16:48:01 DR: Receive error
16:48:01 dr_prsend thread : asfcode = -25582: oserr = 0: errstr = : Network connection is broken.

16:48:01 DR_ERR set to -1
16:48:03 DR: Turned off on primary server
16:48:03 DR: Cannot connect to secondary server
16:48:04 Checkpoint Completed: duration was 6 seconds.
16:48:04 Fri May 13 - loguniq 20260757, logpos 0x435f848, timestamp: 0xf00d38cb Interval: 2346815

16:48:04 Maximum server connections 474
16:48:04 Checkpoint Statistics - Avg. Txn Block Time 4.025, # Txns blocked 82, Plog used 9984, Llog used 20880

16:48:13 DR: Primary server connected
16:48:13 DR: Send error
16:48:13 dr_prsend thread : asfcode = -25582: oserr = 0: errstr = : Network connection is broken.

16:48:13 DR_ERR set to -2
16:48:13 DR: Failure recovery error (2)
16:48:13 SCHAPI: dbutil threads is already running.
16:48:13 SCHAPI: dbScheduler threads is already running.
16:48:14 DR: Turned off on primary server
16:48:14 DR: Cannot connect to secondary server
16:48:24 DR: Primary server connected
16:48:24 DR: Send error
16:48:24 dr_prsend thread : asfcode = -25582: oserr = 0: errstr = : Network connection is broken.

On secondary server:
16:42:32 Maximum server connections 72
16:42:32 Checkpoint Statistics - Avg. Txn Block Time 0.000, # Txns blocked 0, Plog used 1933, Llog used 0

16:42:39 Logical Log 20260756 Complete, timestamp: 0xf0006626.
16:48:12 DR: Received connection request from remote server when DR is not Off
[Local type: Secondary, Current state: ?]
[Remote type: Primary]

16:48:12 DR: ping timeout
16:48:24 DR: Received connection request from remote server when DR is not Off
[Local type: Secondary, Current state: FAILED]
[Remote type: Primary]

16:48:35 DR: Received connection request from remote server when DR is not Off
[Local type: Secondary, Current state: FAILED]
[Remote type: Primary]

We didn't have data replication, but on the primary server the work looked fine.
Because some applications use secondary server, I decided to restart the secondary server.
After restart secondary did not improve and was still out of sync, but on the primary there was a performance problem:
Many sessions had the status of G-BPX-- flags and application processing was slower.
I didn't restart primary for fear of losing data to primary, so I decided to rebuild HDR first and then do a primary restart. And it was done that way.
I was going to upgrade to version FC8, but I don't see any HDR-related fixes there, so I guess I'll wait for critical fixes for HDR.
Regards,
Robert Wolański

------------------------------
Robert Wolanski
------------------------------

Original Message

Original Message:
Sent: Wed June 01, 2022 12:50 PM
From: TOM GIRSCH
Subject: Potentially Serious HDR Bug 14.10.FC7W1

All:

I wanted to make you aware of something we've hit a couple of times and also find out if someone else has encountered it. Twice in the past few months, I've had DR Ping faliures on our HDR primary which seem to have left the cluster in an unstable state. It's as if the primary lost track of the secondary but doesn't realize that it's not there. In online.log on the primary, it looks like this:

09:33:14  DR: Needed to send a ping message but failed. 109:33:44  DR: Needed to send a ping message but failed. 109:34:14  DR: Needed to send a ping message but failed. 109:34:44  DR: Needed to send a ping message but failed. 109:34:54  DR: ping timeout

On the secondary, it looks like this:

2022-05-23 09:35:04  DR: ping timeout2022-05-23 09:35:04.639  DR: Receive error2022-05-23 09:35:04.640  dr_secrcv thread : asfcode = -25582: oserr = 4: errstr = : Network connection is broken. System error = 4.2022-05-23 09:35:04.640  DR_ERR set to -12022-05-23 09:35:04.640  DR: Receive Btree error2022-05-23 09:35:05.639  DR: Turned off on secondary server

Notice that the "turned off on secondary server" message appears on the secondary but there's no such indication on the primary. An onstat -g cluster from the primary still shows the secondary as connected and active. But it's not.

What happens next is that the next time the primary tries to run a checkpoint, it waits for the secondary to acknowledge it, and that never happens because the secondary thinks it's disconnected while the primary thinks it is still connected. Users can still connect to the primary and even run queries, but as soon as they execute anything that would do a logged write, they get blocked on the pending checkpoint. An onstat - on the primary shows CKPT_INP.

When it gets to this point, the primary will no longer respond to onmode commands. The only way to clear the situation is to take down all the other servers in the cluster, then kill the primary using onclean -ky. At that point, it fires right back up as if nothing happened. And I can re-start the other nodes without incident.

Obviously, this should never happen. If the HDR secondary isn't responding, it should time out and be kicked out of the cluster. The primary shouldn't hang for an eternity waiting around for it.

In case it's relevant, we have DRAUTO set to 3 and we're using NEAR_SYNC mode.

I have a case (TS009442018) open on this. Problem is, support wants me to replicate the issue and grab stats next time it happens, which isn't something I'm super keen on doing in production. (For the time being, to avoid the issue, I've demoted the HDR secondary to an RSS.)

------------------------------
TOM GIRSCH
------------------------------

#Informix

5. RE: Potentially Serious HDR Bug 14.10.FC7W1

Like

IBM Champion

TOM GIRSCH

Posted Tue September 27, 2022 11:28 AM

I'm still running into this issue -- it's been at least five times now -- and this time, the primary does show the secondary as being disconnected. Behavior is the same as before; HDR disconnects from primary; primary tries to run a checkpoint, waits for HDR to acknowledge, but since the HDR ain't there, it never will.

From the primary:

$ onstat -g cluster

IBM Informix Dynamic Server Version 14.10.FC8X8 -- On-Line (Prim) (CKPT REQ) -- Up 6 days 02:05:55 -- 3733504 Kbytes
Blocked:CKPT
2022-09-27 15:26:06

Primary Server:ids__chaos_dev__b1
Current Log Page:159117,6597
Index page logging status: Disabled

Server ACKed Log Applied Log Supports Status
(log, page) (log, page) Updates
ids__chaos_dev__b2 159117,6597 0,0 No NEAR_SYNC(HDR),Connected,Failed

This is a non-critical server so I'm leaving it in this state so that support can investigate. Not much luck yet.

From the HDR secondary's point of view:

$ onstat -g cluster

IBM Informix Dynamic Server Version 14.10.FC8X8 -- Read-Only (Sec) -- Up 3 days 19:25:41 -- 3209216 Kbytes
2022-09-27 15:26:51

Primary Server:ids__chaos_dev__b1
Index page logging status: Disabled

Server ACKed Log Supports Status
(log, page) Updates
ids__chaos_dev__b2 159117,6597 No NEAR_SYNC(HDR),Disconnected,Off

------------------------------
TOM GIRSCH
------------------------------

Original Message

Original Message:
Sent: Wed June 01, 2022 12:50 PM
From: TOM GIRSCH
Subject: Potentially Serious HDR Bug 14.10.FC7W1

All:

I wanted to make you aware of something we've hit a couple of times and also find out if someone else has encountered it. Twice in the past few months, I've had DR Ping faliures on our HDR primary which seem to have left the cluster in an unstable state. It's as if the primary lost track of the secondary but doesn't realize that it's not there. In online.log on the primary, it looks like this:

09:33:14  DR: Needed to send a ping message but failed. 109:33:44  DR: Needed to send a ping message but failed. 109:34:14  DR: Needed to send a ping message but failed. 109:34:44  DR: Needed to send a ping message but failed. 109:34:54  DR: ping timeout

On the secondary, it looks like this:

2022-05-23 09:35:04  DR: ping timeout2022-05-23 09:35:04.639  DR: Receive error2022-05-23 09:35:04.640  dr_secrcv thread : asfcode = -25582: oserr = 4: errstr = : Network connection is broken. System error = 4.2022-05-23 09:35:04.640  DR_ERR set to -12022-05-23 09:35:04.640  DR: Receive Btree error2022-05-23 09:35:05.639  DR: Turned off on secondary server

Notice that the "turned off on secondary server" message appears on the secondary but there's no such indication on the primary. An onstat -g cluster from the primary still shows the secondary as connected and active. But it's not.

What happens next is that the next time the primary tries to run a checkpoint, it waits for the secondary to acknowledge it, and that never happens because the secondary thinks it's disconnected while the primary thinks it is still connected. Users can still connect to the primary and even run queries, but as soon as they execute anything that would do a logged write, they get blocked on the pending checkpoint. An onstat - on the primary shows CKPT_INP.

When it gets to this point, the primary will no longer respond to onmode commands. The only way to clear the situation is to take down all the other servers in the cluster, then kill the primary using onclean -ky. At that point, it fires right back up as if nothing happened. And I can re-start the other nodes without incident.

Obviously, this should never happen. If the HDR secondary isn't responding, it should time out and be kicked out of the cluster. The primary shouldn't hang for an eternity waiting around for it.

In case it's relevant, we have DRAUTO set to 3 and we're using NEAR_SYNC mode.

I have a case (TS009442018) open on this. Problem is, support wants me to replicate the issue and grab stats next time it happens, which isn't something I'm super keen on doing in production. (For the time being, to avoid the issue, I've demoted the HDR secondary to an RSS.)

------------------------------
TOM GIRSCH
------------------------------

#Informix

6. RE: Potentially Serious HDR Bug 14.10.FC7W1

Like

IBM Champion

TOM GIRSCH

Posted Tue September 27, 2022 02:11 PM

Support is asking me to hop in my time machine, go back to the moment where things screwed up and gather diagnostics THEN. Uhh, thanks?

------------------------------
TOM GIRSCH
------------------------------

Original Message

Original Message:
Sent: Wed June 01, 2022 12:50 PM
From: TOM GIRSCH
Subject: Potentially Serious HDR Bug 14.10.FC7W1

All:

I wanted to make you aware of something we've hit a couple of times and also find out if someone else has encountered it. Twice in the past few months, I've had DR Ping faliures on our HDR primary which seem to have left the cluster in an unstable state. It's as if the primary lost track of the secondary but doesn't realize that it's not there. In online.log on the primary, it looks like this:

09:33:14  DR: Needed to send a ping message but failed. 109:33:44  DR: Needed to send a ping message but failed. 109:34:14  DR: Needed to send a ping message but failed. 109:34:44  DR: Needed to send a ping message but failed. 109:34:54  DR: ping timeout

On the secondary, it looks like this:

2022-05-23 09:35:04  DR: ping timeout2022-05-23 09:35:04.639  DR: Receive error2022-05-23 09:35:04.640  dr_secrcv thread : asfcode = -25582: oserr = 4: errstr = : Network connection is broken. System error = 4.2022-05-23 09:35:04.640  DR_ERR set to -12022-05-23 09:35:04.640  DR: Receive Btree error2022-05-23 09:35:05.639  DR: Turned off on secondary server

Notice that the "turned off on secondary server" message appears on the secondary but there's no such indication on the primary. An onstat -g cluster from the primary still shows the secondary as connected and active. But it's not.

What happens next is that the next time the primary tries to run a checkpoint, it waits for the secondary to acknowledge it, and that never happens because the secondary thinks it's disconnected while the primary thinks it is still connected. Users can still connect to the primary and even run queries, but as soon as they execute anything that would do a logged write, they get blocked on the pending checkpoint. An onstat - on the primary shows CKPT_INP.

When it gets to this point, the primary will no longer respond to onmode commands. The only way to clear the situation is to take down all the other servers in the cluster, then kill the primary using onclean -ky. At that point, it fires right back up as if nothing happened. And I can re-start the other nodes without incident.

Obviously, this should never happen. If the HDR secondary isn't responding, it should time out and be kicked out of the cluster. The primary shouldn't hang for an eternity waiting around for it.

In case it's relevant, we have DRAUTO set to 3 and we're using NEAR_SYNC mode.

I have a case (TS009442018) open on this. Problem is, support wants me to replicate the issue and grab stats next time it happens, which isn't something I'm super keen on doing in production. (For the time being, to avoid the issue, I've demoted the HDR secondary to an RSS.)

------------------------------
TOM GIRSCH
------------------------------

#Informix

7. RE: Potentially Serious HDR Bug 14.10.FC7W1

Like

IBM Champion

TOM GIRSCH

Posted Wed September 28, 2022 09:54 AM
Edited by System Fri January 20, 2023 04:50 PM

[Tagging @Andreas Legner ]

An update from support. _Finally_ we seem to be getting somewhere. Hopefully we can arrive at root cause and find and fix the bug so that nobody else has to deal with this.

I noticed that two of the encrypt VPs were incredibly busy (each using 100% of a CPU) despite the fact that everything on the server was hanging up. This led to:

We found that the two srvinfx threads are in a loop, it calls a GsKit function where it must receive the data, it puts a lock / mutex, but it does not get anything and returns to same function, it is the same behavior in both VPs (10 & 12).
At this point I need to escalate this investigation to our next level of support.
As soon I have an update about this investigation I will let you to know.
==>>
Tue Sep 27 21:39:49 UTC 2022
#0 0x00007fd2894fb09d in ?? () from /lib64/libgsk8ssl_64.so
#1 0x00007fd2894fb308 in ?? () from /lib64/libgsk8ssl_64.so
#2 0x00007fd2893f23df in gsk_secure_soc_read () from /lib64/libgsk8ssl_64.so
#3 0x00007fd2898a38bf in ISI_ssl_read () from /informix/INST/ids__chaos_dev__b1/lib/libisi.so.2
#4 0x00000000016505bc in isi_ssl_read ()
#5 0x0000000001617146 in ssl_recv ()
#6 0x0000000001611dc2 in recvsocket ()
#7 0x00000000016698ba in slSQIrsp ()
#8 0x00000000016620fc in pfConRes ()
#9 0x00000000015efbbf in ASF_Call ()
#10 0x00000000016045a9 in send_reject.isra.0 ()
#11 0x0000000001603b1b in spawn_thread ()
#12 0x00000000014bcfa3 in th_init_initgls ()
#13 0x00000000015054bf in startup ()
#14 0x0000000000000000 in ?? ()
-----
Tue Sep 27 21:40:34 UTC 2022
#0 0x00007fd288c56478 in gsk_atomic_swap(long*, long) () from /lib64/libgsk8cms_64.so
#1 0x00007fd2894fb329 in ?? () from /lib64/libgsk8ssl_64.so
#2 0x00007fd2893f23df in gsk_secure_soc_read () from /lib64/libgsk8ssl_64.so
#3 0x00007fd2898a38bf in ISI_ssl_read () from /informix/INST/ids__chaos_dev__b1/lib/libisi.so.2
#4 0x00000000016505bc in isi_ssl_read ()
#5 0x0000000001617146 in ssl_recv ()
#6 0x0000000001611dc2 in recvsocket ()
#7 0x00000000016698ba in slSQIrsp ()
#8 0x00000000016620fc in pfConRes ()
#9 0x00000000015efbbf in ASF_Call ()
#10 0x00000000016045a9 in send_reject.isra.0 ()
#11 0x0000000001603b1b in spawn_thread ()
#12 0x00000000014bcfa3 in th_init_initgls ()
#13 0x00000000
-- skipped some lines - mail text was too long --------------------------

------------------------------
TOM GIRSCH
------------------------------

Original Message

Original Message:
Sent: Wed June 01, 2022 12:50 PM
From: TOM GIRSCH
Subject: Potentially Serious HDR Bug 14.10.FC7W1

All:

I wanted to make you aware of something we've hit a couple of times and also find out if someone else has encountered it. Twice in the past few months, I've had DR Ping faliures on our HDR primary which seem to have left the cluster in an unstable state. It's as if the primary lost track of the secondary but doesn't realize that it's not there. In online.log on the primary, it looks like this:

09:33:14  DR: Needed to send a ping message but failed. 109:33:44  DR: Needed to send a ping message but failed. 109:34:14  DR: Needed to send a ping message but failed. 109:34:44  DR: Needed to send a ping message but failed. 109:34:54  DR: ping timeout

On the secondary, it looks like this:

2022-05-23 09:35:04  DR: ping timeout2022-05-23 09:35:04.639  DR: Receive error2022-05-23 09:35:04.640  dr_secrcv thread : asfcode = -25582: oserr = 4: errstr = : Network connection is broken. System error = 4.2022-05-23 09:35:04.640  DR_ERR set to -12022-05-23 09:35:04.640  DR: Receive Btree error2022-05-23 09:35:05.639  DR: Turned off on secondary server

Notice that the "turned off on secondary server" message appears on the secondary but there's no such indication on the primary. An onstat -g cluster from the primary still shows the secondary as connected and active. But it's not.

What happens next is that the next time the primary tries to run a checkpoint, it waits for the secondary to acknowledge it, and that never happens because the secondary thinks it's disconnected while the primary thinks it is still connected. Users can still connect to the primary and even run queries, but as soon as they execute anything that would do a logged write, they get blocked on the pending checkpoint. An onstat - on the primary shows CKPT_INP.

When it gets to this point, the primary will no longer respond to onmode commands. The only way to clear the situation is to take down all the other servers in the cluster, then kill the primary using onclean -ky. At that point, it fires right back up as if nothing happened. And I can re-start the other nodes without incident.

Obviously, this should never happen. If the HDR secondary isn't responding, it should time out and be kicked out of the cluster. The primary shouldn't hang for an eternity waiting around for it.

In case it's relevant, we have DRAUTO set to 3 and we're using NEAR_SYNC mode.

I have a case (TS009442018) open on this. Problem is, support wants me to replicate the issue and grab stats next time it happens, which isn't something I'm super keen on doing in production. (For the time being, to avoid the issue, I've demoted the HDR secondary to an RSS.)

------------------------------
TOM GIRSCH
------------------------------

#Informix

Informix