Informix

 View Only
  • 1.  Potentially Serious HDR Bug 14.10.FC7W1

    IBM Champion
    Posted 30 days ago

    All:

    I wanted to make you aware of something we've hit a couple of times and also find out if someone else has encountered it. Twice in the past few months, I've had DR Ping faliures on our HDR primary which seem to have left the cluster in an unstable state. It's as if the primary lost track of the secondary but doesn't realize that it's not there. In online.log on the primary, it looks like this:

    09:33:14  DR: Needed to send a ping message but failed. 1
    
    09:33:44  DR: Needed to send a ping message but failed. 1
    
    09:34:14  DR: Needed to send a ping message but failed. 1
    
    09:34:44  DR: Needed to send a ping message but failed. 1
    
    09:34:54  DR: ping timeout

    On the secondary, it looks like this:

    2022-05-23 09:35:04  DR: ping timeout
    
    2022-05-23 09:35:04.639  DR: Receive error
    
    2022-05-23 09:35:04.640  dr_secrcv thread : asfcode = -25582: oserr = 4: errstr = : Network connection is broken.
    
     System error = 4.
    
    2022-05-23 09:35:04.640  DR_ERR set to -1
    
    2022-05-23 09:35:04.640  DR: Receive Btree error
    
    2022-05-23 09:35:05.639  DR: Turned off on secondary server

    Notice that the "turned off on secondary server" message appears on the secondary but there's no such indication on the primary. An onstat -g cluster from the primary still shows the secondary as connected and active. But it's not.

    What happens next is that the next time the primary tries to run a checkpoint, it waits for the secondary to acknowledge it, and that never happens because the secondary thinks it's disconnected while the primary thinks it is still connected. Users can still connect to the primary and even run queries, but as soon as they execute anything that would do a logged write, they get blocked on the pending checkpoint. An onstat - on the primary shows CKPT_INP.

    When it gets to this point, the primary will no longer respond to onmode commands. The only way to clear the situation is to take down all the other servers in the cluster, then kill the primary using onclean -ky. At that point, it fires right back up as if nothing happened. And I can re-start the other nodes without incident.

    Obviously, this should never happen. If the HDR secondary isn't responding, it should time out and be kicked out of the cluster. The primary shouldn't hang for an eternity waiting around for it.

    In case it's relevant, we have DRAUTO set to 3 and we're using NEAR_SYNC mode.

    I have a case (TS009442018) open on this. Problem is, support wants me to replicate the issue and grab stats next time it happens, which isn't something I'm super keen on doing in production. (For the time being, to avoid the issue, I've demoted the HDR secondary to an RSS.)



    ------------------------------
    TOM GIRSCH
    ------------------------------


  • 2.  RE: Potentially Serious HDR Bug 14.10.FC7W1

    IBM Champion
    Posted 30 days ago
    One additional note: I checked the clocks on both servers, and they seem to be in sync. Both use ntpd.  So it's interesting that the messages in the log files are almost a minute apart between primary and secondary.

    ------------------------------
    TOM GIRSCH
    ------------------------------



  • 3.  RE: Potentially Serious HDR Bug 14.10.FC7W1

    Posted 27 days ago
    When this happens
    It may be that the disk IO performance decreases, causing the CKPT to slow down.
    Or onsoclst thread runs on CPU VP0, but CPU VP0 is very busy and LST does not respond

    ------------------------------
    ZhiWei Cui
    GBASE
    ------------------------------



  • 4.  RE: Potentially Serious HDR Bug 14.10.FC7W1

    Posted 23 days ago
    Hi,
    I had similar problem (on Friday, the thirteenth :) ) - case TS009352831, but on version FC4.
    Data did not replicate to the secondary server.
    On primary server:
    16:42:39 Logical Log 20260756 Complete, timestamp: 0xf0007231.
    16:42:39 Logical Log 20260756 - Backup Started
    16:42:43 Logical Log 20260756 - Backup Completed
    16:43:27 DR: Needed to send a ping message but failed. 1
    16:46:27 DR: Needed to send a ping message but failed. 1
    16:46:57 DR: Needed to send a ping message but failed. 1
    16:47:27 DR: Needed to send a ping message but failed. 1
    16:47:57 DR: Needed to send a ping message but failed. 1
    16:48:01 DR: ping timeout
    16:48:01 DR: Receive error
    16:48:01 dr_prsend thread : asfcode = -25582: oserr = 0: errstr = : Network connection is broken.

    16:48:01 DR_ERR set to -1
    16:48:03 DR: Turned off on primary server
    16:48:03 DR: Cannot connect to secondary server
    16:48:04 Checkpoint Completed: duration was 6 seconds.
    16:48:04 Fri May 13 - loguniq 20260757, logpos 0x435f848, timestamp: 0xf00d38cb Interval: 2346815

    16:48:04 Maximum server connections 474
    16:48:04 Checkpoint Statistics - Avg. Txn Block Time 4.025, # Txns blocked 82, Plog used 9984, Llog used 20880

    16:48:13 DR: Primary server connected
    16:48:13 DR: Send error
    16:48:13 dr_prsend thread : asfcode = -25582: oserr = 0: errstr = : Network connection is broken.

    16:48:13 DR_ERR set to -2
    16:48:13 DR: Failure recovery error (2)
    16:48:13 SCHAPI: dbutil threads is already running.
    16:48:13 SCHAPI: dbScheduler threads is already running.
    16:48:14 DR: Turned off on primary server
    16:48:14 DR: Cannot connect to secondary server
    16:48:24 DR: Primary server connected
    16:48:24 DR: Send error
    16:48:24 dr_prsend thread : asfcode = -25582: oserr = 0: errstr = : Network connection is broken.


    On secondary server:
    16:42:32 Maximum server connections 72
    16:42:32 Checkpoint Statistics - Avg. Txn Block Time 0.000, # Txns blocked 0, Plog used 1933, Llog used 0

    16:42:39 Logical Log 20260756 Complete, timestamp: 0xf0006626.
    16:48:12 DR: Received connection request from remote server when DR is not Off
    [Local type: Secondary, Current state: ?]
    [Remote type: Primary]

    16:48:12 DR: ping timeout
    16:48:24 DR: Received connection request from remote server when DR is not Off
    [Local type: Secondary, Current state: FAILED]
    [Remote type: Primary]

    16:48:35 DR: Received connection request from remote server when DR is not Off
    [Local type: Secondary, Current state: FAILED]
    [Remote type: Primary]

    We didn't have data replication, but on the primary server the work looked fine.
    Because some applications use secondary server, I decided to restart the secondary server.
    After restart secondary did not improve and was still out of sync,  but on the primary there was a performance problem:
    Many sessions had the status of G-BPX-- flags and application processing was slower.
    I didn't restart primary for fear of losing data to primary, so I decided to rebuild HDR first and then do a primary restart. And it was done that way.
    I was going to upgrade to version FC8, but I don't see any HDR-related fixes there, so I guess I'll wait for critical fixes for HDR.
    Regards,
    Robert Wolański






    ------------------------------
    Robert Wolanski
    ------------------------------