Expand all | Collapse all

Things That Should Never Happen

  • 1.  Things That Should Never Happen

    Posted Thu February 18, 2021 02:30 PM

    Had a "fun" one today. We have a four-node cluster that features a primary, an HDR secondary (NEAR_SYNC mode), a regular RSS secondary and an RSS secondary with DELAY_APPLY=12H. The cluster is running a patched version, with three of the four nodes running 12.10.FC14XO but the primary still on 12.10.FC14XF, pending a scheduled maintenance window to complete the in-place minor upgrade.

    Today, I did routine maintenance on the DELAY_APPLY RSS node. When I took that node down, big problems on the primary. It stopped accepting new connections. It froze on write transactions for already-connected sessions. And it stopped talking to the other two nodes in the cluster. The primary remained in that mode until the DELAY_APPLY RSS node re-joined the cluster, at which point everything freed back up again. But it was about a 15 minute production outage. It looks like the bulk of that time was spent with the engine stuck in a checkpoint, even though the completed checkpoint showed no block time:

                                                                        Critical Sections                          Physical Log    Logical Log    
               Clock                                  Total Flush Block #      Ckpt  Wait  Long  # Dirty   Dskflu  Total    Avg    Total    Avg   
    Interval   Time      Trigger    LSN               Time  Time  Time  Waits  Time  Time  Time  Buffers   /Sec    Pages    /Sec   Pages    /Sec  
    593376     16:18:51  CKPTINTVL  216176:0x86a018   0.2   0.1   0.0   0      0.0   0.0   0.0   2008      2008    9917     33     2587     8     
    593377     16:36:41  CKPTINTVL  216177:0xa6f0     770.0 0.7   0.0   1720   768.9 338.3 769.0 3291      3291    9552     8      2875     2     
    593378     16:36:57  HDR        216177:0x305018   0.8   0.1   0.0   18     0.6   0.5   0.7   41541     41541   7666     450    791      46

    Has anyone else seen anything like this?

    Because of some code issues on our end, there were residual data problems for nearly an hour after the engine itself recovered.

    OS: CentOS 7, 3.10.0-1127.19.1.el7.x86_64 #1 SMP


  • 2.  RE: Things That Should Never Happen

    Posted Thu February 18, 2021 02:50 PM
    That's a new one on me Tom. Time for a call to support.

    Art S. Kagel, President and Principal Consultant
    ASK Database Management Corp.

  • 3.  RE: Things That Should Never Happen

    Posted Thu February 18, 2021 04:08 PM
    Already done. The connected CMs also dropped out during that window, but did NOT initiate a failover, even though failover was enabled.