Hi,
I had similar problem (on Friday, the thirteenth :) ) - case TS009352831, but on version FC4.
Data did not replicate to the secondary server.
On primary server:
16:42:39 Logical Log 20260756 Complete, timestamp: 0xf0007231.
16:42:39 Logical Log 20260756 - Backup Started
16:42:43 Logical Log 20260756 - Backup Completed
16:43:27 DR: Needed to send a ping message but failed. 1
16:46:27 DR: Needed to send a ping message but failed. 1
16:46:57 DR: Needed to send a ping message but failed. 1
16:47:27 DR: Needed to send a ping message but failed. 1
16:47:57 DR: Needed to send a ping message but failed. 1
16:48:01 DR: ping timeout
16:48:01 DR: Receive error
16:48:01 dr_prsend thread : asfcode = -25582: oserr = 0: errstr = : Network connection is broken.
16:48:01 DR_ERR set to -1
16:48:03 DR: Turned off on primary server
16:48:03 DR: Cannot connect to secondary server
16:48:04 Checkpoint Completed: duration was 6 seconds.
16:48:04 Fri May 13 - loguniq 20260757, logpos 0x435f848, timestamp: 0xf00d38cb Interval: 2346815
16:48:04 Maximum server connections 474
16:48:04 Checkpoint Statistics - Avg. Txn Block Time 4.025, # Txns blocked 82, Plog used 9984, Llog used 20880
16:48:13 DR: Primary server connected
16:48:13 DR: Send error
16:48:13 dr_prsend thread : asfcode = -25582: oserr = 0: errstr = : Network connection is broken.
16:48:13 DR_ERR set to -2
16:48:13 DR: Failure recovery error (2)
16:48:13 SCHAPI: dbutil threads is already running.
16:48:13 SCHAPI: dbScheduler threads is already running.
16:48:14 DR: Turned off on primary server
16:48:14 DR: Cannot connect to secondary server
16:48:24 DR: Primary server connected
16:48:24 DR: Send error
16:48:24 dr_prsend thread : asfcode = -25582: oserr = 0: errstr = : Network connection is broken.
On secondary server:
16:42:32 Maximum server connections 72
16:42:32 Checkpoint Statistics - Avg. Txn Block Time 0.000, # Txns blocked 0, Plog used 1933, Llog used 0
16:42:39 Logical Log 20260756 Complete, timestamp: 0xf0006626.
16:48:12 DR: Received connection request from remote server when DR is not Off
[Local type: Secondary, Current state: ?]
[Remote type: Primary]
16:48:12 DR: ping timeout
16:48:24 DR: Received connection request from remote server when DR is not Off
[Local type: Secondary, Current state: FAILED]
[Remote type: Primary]
16:48:35 DR: Received connection request from remote server when DR is not Off
[Local type: Secondary, Current state: FAILED]
[Remote type: Primary]
We didn't have data replication, but on the primary server the work looked fine.
Because some applications use secondary server, I decided to restart the secondary server.
After restart secondary did not improve and was still out of sync, but on the primary there was a performance problem:
Many sessions had the status of G-BPX-- flags and application processing was slower.
I didn't restart primary for fear of losing data to primary, so I decided to rebuild HDR first and then do a primary restart. And it was done that way.
I was going to upgrade to version FC8, but I don't see any HDR-related fixes there, so I guess I'll wait for critical fixes for HDR.
Regards,
Robert Wolański
------------------------------
Robert Wolanski
------------------------------
Original Message:
Sent: Wed June 01, 2022 12:50 PM
From: TOM GIRSCH
Subject: Potentially Serious HDR Bug 14.10.FC7W1
All:
I wanted to make you aware of something we've hit a couple of times and also find out if someone else has encountered it. Twice in the past few months, I've had DR Ping faliures on our HDR primary which seem to have left the cluster in an unstable state. It's as if the primary lost track of the secondary but doesn't realize that it's not there. In online.log on the primary, it looks like this:
09:33:14 DR: Needed to send a ping message but failed. 109:33:44 DR: Needed to send a ping message but failed. 109:34:14 DR: Needed to send a ping message but failed. 109:34:44 DR: Needed to send a ping message but failed. 109:34:54 DR: ping timeout
On the secondary, it looks like this:
2022-05-23 09:35:04 DR: ping timeout2022-05-23 09:35:04.639 DR: Receive error2022-05-23 09:35:04.640 dr_secrcv thread : asfcode = -25582: oserr = 4: errstr = : Network connection is broken. System error = 4.2022-05-23 09:35:04.640 DR_ERR set to -12022-05-23 09:35:04.640 DR: Receive Btree error2022-05-23 09:35:05.639 DR: Turned off on secondary server
Notice that the "turned off on secondary server" message appears on the secondary but there's no such indication on the primary. An onstat -g cluster from the primary still shows the secondary as connected and active. But it's not.
What happens next is that the next time the primary tries to run a checkpoint, it waits for the secondary to acknowledge it, and that never happens because the secondary thinks it's disconnected while the primary thinks it is still connected. Users can still connect to the primary and even run queries, but as soon as they execute anything that would do a logged write, they get blocked on the pending checkpoint. An onstat - on the primary shows CKPT_INP.
When it gets to this point, the primary will no longer respond to onmode commands. The only way to clear the situation is to take down all the other servers in the cluster, then kill the primary using onclean -ky. At that point, it fires right back up as if nothing happened. And I can re-start the other nodes without incident.
Obviously, this should never happen. If the HDR secondary isn't responding, it should time out and be kicked out of the cluster. The primary shouldn't hang for an eternity waiting around for it.
In case it's relevant, we have DRAUTO set to 3 and we're using NEAR_SYNC mode.
I have a case (TS009442018) open on this. Problem is, support wants me to replicate the issue and grab stats next time it happens, which isn't something I'm super keen on doing in production. (For the time being, to avoid the issue, I've demoted the HDR secondary to an RSS.)
------------------------------
TOM GIRSCH
------------------------------
#Informix