Informix

Informix

Connect with Db2, Informix, Netezza, open source, and other data experts to gain value from your data, share insights, and solve problems.

 View Only
Expand all | Collapse all

What triggers a blocking checkpoint?

  • 1.  What triggers a blocking checkpoint?

    Posted 22 days ago

    Hi Y'all family.

    In the system under my purvey, checkpoints occur very often, almost always duration is on the order of 1/10 of a second.  However (Bet'cha I telegraphed that coming :-), a couple of times a day, there is a blocking checkpoint that temporarily blocks all applications for the duration. Most of the time it happens in early morning but often enough it occurs at other times.  I did a grep on the current online.log:

    $ egrep -e 'Block Time [1-9]' /usr/informix/online.log|awk '$10 >= 20.0 {print}'

    05/20/25 06:31:23 Checkpoint Statistics - Avg. Txn Block Time 9.120,  # Txns blocked 16,  Plog used 64451,  Llog used 160854
    05/20/25 09:03:15 Checkpoint Statistics - Avg. Txn Block Time 9.575,  # Txns blocked 190, Plog used 193460, Llog used 111805
    05/21/25 09:02:59 Checkpoint Statistics - Avg. Txn Block Time 7.341,  # Txns blocked 94,  Plog used 157671, Llog used 112949
    05/22/25 06:48:09 Checkpoint Statistics - Avg. Txn Block Time 32.120, # Txns blocked 114, Plog used 263959, Llog used 126449
    05/23/25 06:31:34 Checkpoint Statistics - Avg. Txn Block Time 8.778,  # Txns blocked 21,  Plog used 129296, Llog used 193806
    05/23/25 09:02:47 Checkpoint Statistics - Avg. Txn Block Time 6.448,  # Txns blocked 87,  Plog used 123006, Llog used  87426
    05/25/25 06:00:10 Checkpoint Statistics - Avg. Txn Block Time 3.855,  # Txns blocked 47,  Plog used 39997,  Llog used  55223
    05/25/25 06:36:17 Checkpoint Statistics - Avg. Txn Block Time 30.251, # Txns blocked 55,  Plog used 79072,  Llog used 149772
    05/26/25 06:36:12 Checkpoint Statistics - Avg. Txn Block Time 60.311, # Txns blocked 85,  Plog used 238803, Llog used 213004
    05/27/25 00:27:57 Checkpoint Statistics - Avg. Txn Block Time 22.939, # Txns blocked 38,  Plog used 16940,  Llog used  27395
    05/27/25 06:44:15 Checkpoint Statistics - Avg. Txn Block Time 48.694, # Txns blocked 68,  Plog used 44170,  Llog used 140095

    With the awk command I tried to see only those with duration over 20 seconds.  Why awk is ignoring my filter is out of the scope of this question.)  I have see durations as long as 150 seconds.  Interestingly enough, each of these that I chased down is preceded by a bunch of lines like this, from this morning:

    05/27/25 06:43:26  Physical Logging while in Critical Section: Number of pages logged in critcal section: 40 Remaining Phyical Log: 1003977
    05/27/25 06:43:29  Physical Logging while in Critical Section: Number of pages logged in critcal section: 80 Remaining Phyical Log: 1003936
    05/27/25 06:43:33  Physical Logging while in Critical Section: Number of pages logged in critcal section: 120 Remaining Phyical Log: 1003896
    05/27/25 06:43:36  Physical Logging while in Critical Section: Number of pages logged in critcal section: 160 Remaining Phyical Log: 1003855
    05/27/25 06:43:39  Physical Logging while in Critical Section: Number of pages logged in critcal section: 200 Remaining Phyical Log: 1003815
    05/27/25 06:43:42  Physical Logging while in Critical Section: Number of pages logged in critcal section: 240 Remaining Phyical Log: 1003774
    05/27/25 06:43:45  Physical Logging while in Critical Section: Number of pages logged in critcal section: 280 Remaining Phyical Log: 1003734
    05/27/25 06:43:48  Physical Logging while in Critical Section: Number of pages logged in critcal section: 320 Remaining Phyical Log: 1003694
    05/27/25 06:43:51  Physical Logging while in Critical Section: Number of pages logged in critcal section: 360 Remaining Phyical Log: 1003653
    05/27/25 06:43:54  Physical Logging while in Critical Section: Number of pages logged in critcal section: 400 Remaining Phyical Log: 1003613
    05/27/25 06:43:57  Physical Logging while in Critical Section: Number of pages logged in critcal section: 440 Remaining Phyical Log: 1003572
    05/27/25 06:44:00  Physical Logging while in Critical Section: Number of pages logged in critcal section: 480 Remaining Phyical Log: 1003532
    05/27/25 06:44:04  Physical Logging while in Critical Section: Number of pages logged in critcal section: 520 Remaining Phyical Log: 1003491
    05/27/25 06:44:07  Physical Logging while in Critical Section: Number of pages logged in critcal section: 560 Remaining Phyical Log: 1003451
    05/27/25 06:44:10  Physical Logging while in Critical Section: Number of pages logged in critcal section: 600 Remaining Phyical Log: 1003411
    05/27/25 06:44:15  Checkpoint Completed:  duration was 51 seconds.
    05/27/25 06:44:15  Tue May 27 - loguniq 2967023, logpos 0x41b2444, timestamp: 0xea43c2a9 Interval: 2341876

    05/27/25 06:44:15  Maximum server connections 407
    05/27/25 06:44:15  Checkpoint Statistics - Avg. Txn Block Time 48.694, # Txns blocked 68, Plog used 44170, Llog used 140095

    Referring back to  this post about Physical Logging message, I understand that John Lengyel has rephrased the message. Still, it can't be coincidence that these messages appear during a blocking checkpoint.  And I think in this case it is not useless as JL suggests.  Scary? Yeah, kinda.

    So back to the bottom line question in my subject line: What causes a blocking checkpoint?  Its a pretty good bet that nobody is doing TRUNCATE operation and I don't see any sign of a long-transaction rollback in the log.  If we know what could be casueing it we may be able to re-engineer the application code to stop doing that.

    Thanks for ideas here.



    ------------------------------
    -- Jacob Salomon
    ------------------------------


  • 2.  RE: What triggers a blocking checkpoint?

    Posted 22 days ago

    Jacob,

    I cannot answer your checkpoint question, but your awk command failed
    because field 10 ends with a ',' so the test ($10 >= 20.0) is failing.
    Use sed to strip out the commas first:

    egrep -e 'Block Time [1-9]' /usr/informix/online.log | \
        sed -e 's/,//' x | awk '$10 >= 20.0 {print $10}'

    scot




  • 3.  RE: What triggers a blocking checkpoint?

    Posted 22 days ago

    Oops, extraneous 'x' in there from my testing.
    Correct command is:

    egrep -e 'Block Time [1-9]' /usr/informix/online.log | \
        sed -e 's/,//' | awk '$10 >= 20.0 {print $10}'

    Why do we always catch the error after we hit SEND?




  • 4.  RE: What triggers a blocking checkpoint?

    Posted 21 days ago

    Scott,

    The interval of time between when you hit [Send] and you realize your mistake is is called an "ohnosecond". Similar for when you press [ENTER] on that delete command that you should have edited. ;-)

    This forum needs emojis! 8-)

    Thanks Scott, for the AWK tidbit.  But I'm still in the dark about the blocked checkpoints.



    ------------------------------
    -- Jacob Salomon
    ------------------------------



  • 5.  RE: What triggers a blocking checkpoint?

    Posted 21 days ago

    Jacob:

    Need more details like the reported trigger in the onstat -g ckp report or the supporting sysmaster table. Since only a short list of checkpoints are kept online, the best thing to do would be to set up a cron job to run every 15 mins from 6:00 until 10:00 daily and post or send me the output file(s).

    Art



    ------------------------------
    Art S. Kagel, President and Principal Consultant
    ASK Database Management Corp.
    www.askdbmgt.com
    ------------------------------



  • 6.  RE: What triggers a blocking checkpoint?

    Posted 21 days ago
    Thanks for the idea, Art.

    Will do.

    It occurred to me that I could set up a daemon to tail the online log and react whenever it sees a line with "duration was". If the duration is more than 20 seconds then run the onstat -g ckp.
    But for now, just using the simple cron.

    I'll be watching te output for any cause other than CKPTINTVL.

    • Jacob S





  • 7.  RE: What triggers a blocking checkpoint?

    Posted 21 days ago

    Hats off (but not the kipa :-) to Art, and both Andreas's.

    The overwhelming proportion of checkpoints last under 1/2 second, hough some go over 5 seconds. Nobody complains about those.  But this morning had a 2-1/2 minute blockage. OUCH!

    First Art: I indeed set up that 15-minute cron job.  This morning I checked the online log and found, at 6:46 AM, the following:

    05/28/25 06:43:25  Logical Log 2967450 - Backup Started
    05/28/25 06:43:30  Logical Log 2967450 - Backup Completed
    05/28/25 06:43:43  Physical Logging while in Critical Section: Number of pages logged in critcal section: 40 Remaining Phyical Log: 862792
    05/28/25 06:43:46  Physical Logging while in Critical Section: Number of pages logged in critcal section: 80 Remaining Phyical Log: 862752
    .... (total of 49 such lines)

    05/28/25 06:46:05  Physical Logging while in Critical Section: Number of pages logged in critcal section: 1920 Remaining Phyical Log: 860893
    05/28/25 06:46:08  Physical Logging while in Critical Section: Number of pages logged in critcal section: 1960 Remaining Phyical Log: 860852
    05/28/25 06:46:18  Checkpoint Completed:  duration was 157 seconds.
    05/28/25 06:46:18  Wed May 28 - loguniq 2967451, logpos 0x3a163a8, timestamp: 0x304bd7d6 Interval: 2342399

    05/28/25 06:46:18  Maximum server connections 415
    05/28/25 06:46:18  Checkpoint Statistics - Avg. Txn Block Time 149.940, # Txns blocked 81, Plog used 186726, Llog used 124998

    And yes, there was a 2-1/2 minute freeze-up starting at 6:43.  So now, let's look at the 15-minute interval for the onstat -g ckp. (We don't need to see all 20 entries)

    AUTO_CKPTS=On   RTO_SERVER_RESTART=Off

                                                                        Critical Sections                          Physical Log    Logical Log
               Clock                                  Total Flush Block #      Ckpt  Wait  Long  # Dirty   Dskflu  Total    Avg    Total    Avg
    Interval   Time      Trigger    LSN               Time  Time  Time  Waits  Time  Time  Time  Buffers   /Sec    Pages    /Sec   Pages    /Sec

    2342398    06:39:39  CKPTINTVL  2967448:0xc686018 4.6   4.6   0.0   3      0.0   0.0   0.0   150708    33112   244770   995    178586   725
    2342399    06:46:17  CKPTINTVL  2967451:0x3a163a8 157.9 7.6   0.0   81     149.9 82.4  150.0 131521    17328   186726   473    124998   317
    2342400    06:50:19  CKPTINTVL  2967454:0x2ce3280 2.5   2.5   0.0   2      0.0   0.0   0.0   141278    56862   306279   1234   150773   607

    This is really disappointing! I had expected that middle one above to be caused by an LTXEHWM or a physical log over 75% full.  Nope: It was the ordinary checkpoint triggered by the 4-minute setting.

    So, as Thomas Edison was famous for saying: We have not failed; we have a method that does not work.  And actually, it did tell us something: A direction where NOT to bother looking.

    Now for Andreas L's thoughtful thesis.  Bottom line:

    $ onstat -c MAX_FILL_DATA_PAGES
    1

    Exactly as Andreas supposed. There is an oncheck command (-T?) to report on the fullness of pages within a TBLspace but in a database this size that's kinda impractical.

    But I will recommend changing that onconfig parameter.  Certainly worth a try!

    Thanks much to all.  I'll let y'all know if this solved it.



    ------------------------------
    Jacob Salomon
    ------------------------------



  • 8.  RE: What triggers a blocking checkpoint?

    Posted 21 days ago

    Jacob:

    What version are you running there now (it's been a while so I don't remember)?

    Here's the deal with MAX_FILL_DATA_PAGES:

    1. When originally implemented, if a new row, or an expanded row after update, would not fit on the targeted page, or its original page, the engine had to find a new location for it. That would trigger a search of the table's bitmaps to find a page with at least some space, read it into the cache, then see if the new/updated row would fit there. If so, done, but if not, the next row whose bitmap entry indicated that it was less than 100% full would be tried, and so on.
    2. That was a really slow process in any variable length table that had frequent updates and/or deletes. So, later versions created the PFSC (Partition Free Space Cache). This was enabled by setting the PFSC parameter to 1 in the ONCONFIG file. Then any page from a variable length table that was encountered after startup that had some free space (aka less than 90% full) was added dynamically to an initial empty cache of page numbers that had space along with the number of free bytes (or free percent - that's not clear). Pages with free space that had not been examined since startup were not in the PFSC. However, if the new/updated row did not fit on one of the "encoutered" and so cached pages, the engine went back to #1. So still slow.
    3. Finally, in later releases of v14.10 the new ONCONFIG parameter PFSC_BOOST could be set to 1 which would cause the engine to fill the PFSC's for all variable length tables during startup so that a page with sufficient free space could be quickly located and read in if not already in a buffer cache.

    So, depending whether you are using a version with the PFSC code or not, and if so whether the parameter enabling it is set, determines determines whether to turn off MAX_FILL_DATA_PAGES or try to turn on PFSC.

    If you already have PFSC enabled and have a sufficiently recent release, but do not have PFSC_BOOST turned on, then try that. Both parameters are dynamic so can be set using onmode.

    Art



    ------------------------------
    Art S. Kagel, President and Principal Consultant
    ASK Database Management Corp.
    www.askdbmgt.com
    ------------------------------



  • 9.  RE: What triggers a blocking checkpoint?

    Posted 21 days ago
    Hi Jacob,

    when looking at the numbers, you can see that the amount of pages dumped during the long
    checkpoint was not significantly higher than the other ones.
    I would have expected a huge increase in the number of dirty pages handled during the checkpoint.
    Are you sure there is no I/O bottleneck? Caused by another process ?

    In case you want to flush pages between the checkpoint interval, you need to configure the flushers,
    in order to start flushing at x% of the configured LRU queues (which should be below the number of pages,
    which are usually flushed).
    Check with onstat -R for currently configured values.
    In that case, server will start flushing between the checkpoints until the low percentage is reached.

    Best,

    MARCUS HAARMANN






  • 10.  RE: What triggers a blocking checkpoint?

    Posted 21 days ago

    Jacob:

    On the new data, it looks like the long checkpoints are being caused by high transaction rates (hence the messages about the physical log and the "normal" page flush counts) that are causing the checkpoint to have to wait for many existing critical sections of client code to complete releasing that latch. The MAX_FILL_DATA_PAGES can be one cause of long critical sections, but there can be others, less easily worked around. As Andreas suggested, address this problem first as I suggested in my last post depending your version and config. If that doesn't solve it, then you can take the next steps.

    Art



    ------------------------------
    Art S. Kagel, President and Principal Consultant
    ASK Database Management Corp.
    www.askdbmgt.com
    ------------------------------



  • 11.  RE: What triggers a blocking checkpoint?

    Posted 20 days ago
    Art and others who asked about our current release: IDS 12.10.FC9






  • 12.  RE: What triggers a blocking checkpoint?

    Posted 20 days ago

    Jacob:

    OK, so v12.10.XX does not have the PFSC, so you should disable MAX_FILL_DATA_PAGE and see if that relieves the issue.

    Art



    ------------------------------
    Art S. Kagel, President and Principal Consultant
    ASK Database Management Corp.
    www.askdbmgt.com
    ------------------------------



  • 13.  RE: What triggers a blocking checkpoint?

    Posted 21 days ago

    Hallo Jacob,

    It is much simpler. All onstat -g ckp data is in the sysadmin:mon_checkpoint table when the scheduler is running in default configuration. You can also simply select the records with a long runtime there.

     select dbinfo('UTC_TO_DATETIME', clock_time) as time, * from mon_checkpoint 
     where cp_time > 20
     order by 1 desc;

    regards

    Andreas



    ------------------------------
    Andreas Seifert
    CURSOR Software AG
    http://www.admin-scout.com
    ------------------------------



  • 14.  RE: What triggers a blocking checkpoint?

    Posted 21 days ago
    Hi Jacob,

    if I understand you right, you do not want to know afterwards, when long checkpoints occurred,
    but what is (was) happening when such a long checkpoint is triggered.
    A long checkpoint typically is a result of very active writing.
    (or too slow I/O).
    As far as I can see from your request, these are happening roughly at a specific time,
    so it would either be a perpetual user activity or a job running at this time.
    In order to identify the activity that results in such a long checkpoint, I would monitor the DB activity
    with a cron job, running each minute around the suspicious time, which captures the output of onstat -u 
    (sorted desc by column locks or nwrites) and the corresponding onstat -g sql  and onstat -g ses 
    for the top 3 sessions from this output.

    onstat -u | grep ^[0-9] | grep -v informix | sort -r -n -k 10 | head -3 | awk ' { print $3 }' > ${TEMPFILE}
    for i in `cat ${TEMPFILE}
    do
         onstat -g sql $i >> ${OUTFILE}
         onstat -g ses $i >> ${OUTFILE}
    done    

    This should give you a hint which sessions are writing a lot of data during the time
     and what kind of operation is currently done (and which user/machine is doing something) .

    Best,

    MARCUS HAARMANN






  • 15.  RE: What triggers a blocking checkpoint?

    Posted 21 days ago

    In database systems like Informix, a checkpoint is a critical process that synchronizes the database buffers in memory with the data on disk. It ensures that committed transactions are hardened to storage up to a certain point, which is essential for crash recovery.

    A "blocking" checkpoint in Informix specifically refers to the fact that while the checkpoint process is writing modified data pages from memory to disk, it can cause transactions that need to write to the logical logs (like INSERT, UPDATE, DELETE) to pause or slow down until the critical phases of the checkpoint are complete. This "blocking" ensures that the physical state on disk matches the logical state up to the checkpoint LSN (Log Sequence Number) before significant new work proceeds, guaranteeing a clean recovery point.

    Good luck!
    SUMAN



    ------------------------------
    Suman Suhag
    Dev Bhoomi Uttarakhand University
    Data Science Student
    +91 8950196825 [Jhajjar, Haryana, India]
    ------------------------------



  • 16.  RE: What triggers a blocking checkpoint?

    Posted 21 days ago
    Edited by Andreas Legner 21 days ago

    Hi Jacob,

    first of all, there's an important distinction to make here, between blocking checkpoints and blocking checkpoints (you read this right).

    The official term "blocking checkpoint", since introduction of "non-blocking checkpoint" with version 11.10, refers to a checkpoint that intentionally blocks all modification and logging for its entire duration (which typically is determined by the time it takes to flush all the dirty buffers).   You'd be able to tell such checkpoints from an asterisk in front of the 'Trigger' in onstat -g ckp, or from the 'type' column in sysadmin:mon_checkpoint.  Such checkpoints are rare, don't normally occur and get triggered by things like archives, chunk creation, ... I'm assuming this is not what we're talking about here.

    The type of checkpoint blocking you're observing likely is what every checkpoint, even nominally non-blocking ones, has to incur at its start, normally very briefly and not noticeable: for gathering the list of buffers to flush, the checkpoint has to briefly halt the world by

    • blocking any new modifications
    • waiting on any just ongoing (individual) modifications, e.g. insert of a row, to complete

    We're calling this

    • preventing any threads from entering a "critical section", an operation that cannot interfere with a checkpoint
    • waiting on any thread being "in critical section" to exit from it

    The "Txns Blocked" in those message log entries refers to the first, the "Avg. Txn Block Time" (or "Wait Time" + "Long Time" in onstat) to the second.

    These "critical sections" are meant to be very short, in terms of time, so said waiting should be negligible, but there are situations where they can become long, sometimes very long - and this would primarily be noticed by the DML behind them but could also impact a checkpoint should they coincide.

    So first question:  do you see signs of a performance problem also between such long blocking checkpoints?

    The most common scenario causing such long lasting critical sections:

    inserts into variable row length tables searching for a page with sufficient free space for the new row, with MAX_FILL_DATA_PAGES set to 1 and bitmap suggesting free space, for many data pages, that then each time turns out to be insufficient.

    There've been changes in this area, one question is what version you're running, and I'd suggest a support case to clarify the details.

    HTH,
     Andreas



    ------------------------------
    Andreas Legner
    Informix Dev
    HCL Software
    ------------------------------