Had a "fun" one today. We have a four-node cluster that features a primary, an HDR secondary (NEAR_SYNC mode), a regular RSS secondary and an RSS secondary with DELAY_APPLY=12H. The cluster is running a patched version, with three of the four nodes running 12.10.FC14XO but the primary still on 12.10.FC14XF, pending a scheduled maintenance window to complete the in-place minor upgrade.
Today, I did routine maintenance on the DELAY_APPLY RSS node. When I took that node down, big problems on the primary. It stopped accepting new connections. It froze on write transactions for already-connected sessions. And it stopped talking to the other two nodes in the cluster. The primary remained in that mode until the DELAY_APPLY RSS node re-joined the cluster, at which point everything freed back up again. But it was about a 15 minute production outage. It looks like the bulk of that time was spent with the engine stuck in a checkpoint, even though the completed checkpoint showed no block time:
AUTO_CKPTS=On RTO_SERVER_RESTART=Off
Critical Sections Physical Log Logical Log
Clock Total Flush Block # Ckpt Wait Long # Dirty Dskflu Total Avg Total Avg
Interval Time Trigger LSN Time Time Time Waits Time Time Time Buffers /Sec Pages /Sec Pages /Sec
593376 16:18:51 CKPTINTVL 216176:0x86a018 0.2 0.1 0.0 0 0.0 0.0 0.0 2008 2008 9917 33 2587 8
593377 16:36:41 CKPTINTVL 216177:0xa6f0 770.0 0.7 0.0 1720 768.9 338.3 769.0 3291 3291 9552 8 2875 2
593378 16:36:57 HDR 216177:0x305018 0.8 0.1 0.0 18 0.6 0.5 0.7 41541 41541 7666 450 791 46
Has anyone else seen anything like this?
Because of some code issues on our end, there were residual data problems for nearly an hour after the engine itself recovered.
OS: CentOS 7, 3.10.0-1127.19.1.el7.x86_64 #1 SMP
------------------------------
TOM GIRSCH
------------------------------
#Informix