What triggers a blocking checkpoint?

4. RE: What triggers a blocking checkpoint?

Like

Jacob Salomon

Posted Tue May 27, 2025 04:13 PM

Scott,

The interval of time between when you hit [Send] and you realize your mistake is is called an "ohnosecond". Similar for when you press [ENTER] on that delete command that you should have edited. ;-)

This forum needs emojis! 8-)

Thanks Scott, for the AWK tidbit. But I'm still in the dark about the blocked checkpoints.

------------------------------
-- Jacob Salomon
------------------------------

Original Message

Original Message:
Sent: Tue May 27, 2025 03:39 PM
From: Scot Jenkins
Subject: What triggers a blocking checkpoint?

Oops, extraneous 'x' in there from my testing.
Correct command is:

egrep -e 'Block Time [1-9]' /usr/informix/online.log | \
sed -e 's/,//' | awk '$10 >= 20.0 {print $10}'

Why do we always catch the error after we hit SEND?

Original Message:
Sent: Tue May 27, 2025 03:37 PM
From: Scot Jenkins
Subject: What triggers a blocking checkpoint?

Jacob,

I cannot answer your checkpoint question, but your awk command failed
because field 10 ends with a ',' so the test ($10 >= 20.0) is failing.
Use sed to strip out the commas first:

egrep -e 'Block Time [1-9]' /usr/informix/online.log | \
sed -e 's/,//' x | awk '$10 >= 20.0 {print $10}'

scot

Original Message:
Sent: Tue May 27, 2025 02:29 PM
From: Jacob Salomon
Subject: What triggers a blocking checkpoint?

Hi Y'all family.

In the system under my purvey, checkpoints occur very often, almost always duration is on the order of 1/10 of a second. However (Bet'cha I telegraphed that coming :-), a couple of times a day, there is a blocking checkpoint that temporarily blocks all applications for the duration. Most of the time it happens in early morning but often enough it occurs at other times. I did a grep on the current online.log:

$ egrep -e 'Block Time [1-9]' /usr/informix/online.log|awk '$10 >= 20.0 {print}'

05/20/25 06:31:23 Checkpoint Statistics - Avg. Txn Block Time 9.120, # Txns blocked 16, Plog used 64451, Llog used 160854
05/20/25 09:03:15 Checkpoint Statistics - Avg. Txn Block Time 9.575, # Txns blocked 190, Plog used 193460, Llog used 111805
05/21/25 09:02:59 Checkpoint Statistics - Avg. Txn Block Time 7.341, # Txns blocked 94, Plog used 157671, Llog used 112949
05/22/25 06:48:09 Checkpoint Statistics - Avg. Txn Block Time 32.120, # Txns blocked 114, Plog used 263959, Llog used 126449
05/23/25 06:31:34 Checkpoint Statistics - Avg. Txn Block Time 8.778, # Txns blocked 21, Plog used 129296, Llog used 193806
05/23/25 09:02:47 Checkpoint Statistics - Avg. Txn Block Time 6.448, # Txns blocked 87, Plog used 123006, Llog used 87426
05/25/25 06:00:10 Checkpoint Statistics - Avg. Txn Block Time 3.855, # Txns blocked 47, Plog used 39997, Llog used 55223
05/25/25 06:36:17 Checkpoint Statistics - Avg. Txn Block Time 30.251, # Txns blocked 55, Plog used 79072, Llog used 149772
05/26/25 06:36:12 Checkpoint Statistics - Avg. Txn Block Time 60.311, # Txns blocked 85, Plog used 238803, Llog used 213004
05/27/25 00:27:57 Checkpoint Statistics - Avg. Txn Block Time 22.939, # Txns blocked 38, Plog used 16940, Llog used 27395
05/27/25 06:44:15 Checkpoint Statistics - Avg. Txn Block Time 48.694, # Txns blocked 68, Plog used 44170, Llog used 140095

With the awk command I tried to see only those with duration over 20 seconds. Why awk is ignoring my filter is out of the scope of this question.) I have see durations as long as 150 seconds. Interestingly enough, each of these that I chased down is preceded by a bunch of lines like this, from this morning:

05/27/25 06:43:26 Physical Logging while in Critical Section: Number of pages logged in critcal section: 40 Remaining Phyical Log: 1003977
05/27/25 06:43:29 Physical Logging while in Critical Section: Number of pages logged in critcal section: 80 Remaining Phyical Log: 1003936
05/27/25 06:43:33 Physical Logging while in Critical Section: Number of pages logged in critcal section: 120 Remaining Phyical Log: 1003896
05/27/25 06:43:36 Physical Logging while in Critical Section: Number of pages logged in critcal section: 160 Remaining Phyical Log: 1003855
05/27/25 06:43:39 Physical Logging while in Critical Section: Number of pages logged in critcal section: 200 Remaining Phyical Log: 1003815
05/27/25 06:43:42 Physical Logging while in Critical Section: Number of pages logged in critcal section: 240 Remaining Phyical Log: 1003774
05/27/25 06:43:45 Physical Logging while in Critical Section: Number of pages logged in critcal section: 280 Remaining Phyical Log: 1003734
05/27/25 06:43:48 Physical Logging while in Critical Section: Number of pages logged in critcal section: 320 Remaining Phyical Log: 1003694
05/27/25 06:43:51 Physical Logging while in Critical Section: Number of pages logged in critcal section: 360 Remaining Phyical Log: 1003653
05/27/25 06:43:54 Physical Logging while in Critical Section: Number of pages logged in critcal section: 400 Remaining Phyical Log: 1003613
05/27/25 06:43:57 Physical Logging while in Critical Section: Number of pages logged in critcal section: 440 Remaining Phyical Log: 1003572
05/27/25 06:44:00 Physical Logging while in Critical Section: Number of pages logged in critcal section: 480 Remaining Phyical Log: 1003532
05/27/25 06:44:04 Physical Logging while in Critical Section: Number of pages logged in critcal section: 520 Remaining Phyical Log: 1003491
05/27/25 06:44:07 Physical Logging while in Critical Section: Number of pages logged in critcal section: 560 Remaining Phyical Log: 1003451
05/27/25 06:44:10 Physical Logging while in Critical Section: Number of pages logged in critcal section: 600 Remaining Phyical Log: 1003411
05/27/25 06:44:15 Checkpoint Completed: duration was 51 seconds.
05/27/25 06:44:15 Tue May 27 - loguniq 2967023, logpos 0x41b2444, timestamp: 0xea43c2a9 Interval: 2341876

05/27/25 06:44:15 Maximum server connections 407
05/27/25 06:44:15 Checkpoint Statistics - Avg. Txn Block Time 48.694, # Txns blocked 68, Plog used 44170, Llog used 140095

Referring back to this post about Physical Logging message, I understand that John Lengyel has rephrased the message. Still, it can't be coincidence that these messages appear during a blocking checkpoint. And I think in this case it is not useless as JL suggests. Scary? Yeah, kinda.

So back to the bottom line question in my subject line: What causes a blocking checkpoint? Its a pretty good bet that nobody is doing TRUNCATE operation and I don't see any sign of a long-transaction rollback in the log. If we know what could be casueing it we may be able to re-engineer the application code to stop doing that.

Thanks for ideas here.

------------------------------
-- Jacob Salomon
------------------------------

8. RE: What triggers a blocking checkpoint?

Like

Art Kagel

IBM Champion

Posted Wed May 28, 2025 01:20 PM

Jacob:

What version are you running there now (it's been a while so I don't remember)?

Here's the deal with MAX_FILL_DATA_PAGES:

When originally implemented, if a new row, or an expanded row after update, would not fit on the targeted page, or its original page, the engine had to find a new location for it. That would trigger a search of the table's bitmaps to find a page with at least some space, read it into the cache, then see if the new/updated row would fit there. If so, done, but if not, the next row whose bitmap entry indicated that it was less than 100% full would be tried, and so on.
That was a really slow process in any variable length table that had frequent updates and/or deletes. So, later versions created the PFSC (Partition Free Space Cache). This was enabled by setting the PFSC parameter to 1 in the ONCONFIG file. Then any page from a variable length table that was encountered after startup that had some free space (aka less than 90% full) was added dynamically to an initial empty cache of page numbers that had space along with the number of free bytes (or free percent - that's not clear). Pages with free space that had not been examined since startup were not in the PFSC. However, if the new/updated row did not fit on one of the "encoutered" and so cached pages, the engine went back to #1. So still slow.
Finally, in later releases of v14.10 the new ONCONFIG parameter PFSC_BOOST could be set to 1 which would cause the engine to fill the PFSC's for all variable length tables during startup so that a page with sufficient free space could be quickly located and read in if not already in a buffer cache.

So, depending whether you are using a version with the PFSC code or not, and if so whether the parameter enabling it is set, determines determines whether to turn off MAX_FILL_DATA_PAGES or try to turn on PFSC.

If you already have PFSC enabled and have a sufficiently recent release, but do not have PFSC_BOOST turned on, then try that. Both parameters are dynamic so can be set using onmode.

Art

------------------------------
Art S. Kagel, President and Principal Consultant
ASK Database Management Corp.
www.askdbmgt.com
------------------------------

Original Message

Original Message:
Sent: Wed May 28, 2025 12:40 PM
From: Jacob Salomon
Subject: What triggers a blocking checkpoint?

Hats off (but not the kipa :-) to Art, and both Andreas's.

The overwhelming proportion of checkpoints last under 1/2 second, hough some go over 5 seconds. Nobody complains about those. But this morning had a 2-1/2 minute blockage. OUCH!

First Art: I indeed set up that 15-minute cron job. This morning I checked the online log and found, at 6:46 AM, the following:

05/28/25 06:43:25 Logical Log 2967450 - Backup Started
05/28/25 06:43:30 Logical Log 2967450 - Backup Completed
05/28/25 06:43:43 Physical Logging while in Critical Section: Number of pages logged in critcal section: 40 Remaining Phyical Log: 862792
05/28/25 06:43:46 Physical Logging while in Critical Section: Number of pages logged in critcal section: 80 Remaining Phyical Log: 862752
.... (total of 49 such lines)

05/28/25 06:46:05 Physical Logging while in Critical Section: Number of pages logged in critcal section: 1920 Remaining Phyical Log: 860893
05/28/25 06:46:08 Physical Logging while in Critical Section: Number of pages logged in critcal section: 1960 Remaining Phyical Log: 860852
05/28/25 06:46:18 Checkpoint Completed: duration was 157 seconds.
05/28/25 06:46:18 Wed May 28 - loguniq 2967451, logpos 0x3a163a8, timestamp: 0x304bd7d6 Interval: 2342399

05/28/25 06:46:18 Maximum server connections 415
05/28/25 06:46:18 Checkpoint Statistics - Avg. Txn Block Time 149.940, # Txns blocked 81, Plog used 186726, Llog used 124998

And yes, there was a 2-1/2 minute freeze-up starting at 6:43. So now, let's look at the 15-minute interval for the onstat -g ckp. (We don't need to see all 20 entries)

AUTO_CKPTS=On RTO_SERVER_RESTART=Off

Critical Sections Physical Log Logical Log
Clock Total Flush Block # Ckpt Wait Long # Dirty Dskflu Total Avg Total Avg
Interval Time Trigger LSN Time Time Time Waits Time Time Time Buffers /Sec Pages /Sec Pages /Sec

2342398 06:39:39 CKPTINTVL 2967448:0xc686018 4.6 4.6 0.0 3 0.0 0.0 0.0 150708 33112 244770 995 178586 725
2342399 06:46:17 CKPTINTVL 2967451:0x3a163a8 157.9 7.6 0.0 81 149.9 82.4 150.0 131521 17328 186726 473 124998 317
2342400 06:50:19 CKPTINTVL 2967454:0x2ce3280 2.5 2.5 0.0 2 0.0 0.0 0.0 141278 56862 306279 1234 150773 607

This is really disappointing! I had expected that middle one above to be caused by an LTXEHWM or a physical log over 75% full. Nope: It was the ordinary checkpoint triggered by the 4-minute setting.

So, as Thomas Edison was famous for saying: We have not failed; we have a method that does not work. And actually, it did tell us something: A direction where NOT to bother looking.

Now for Andreas L's thoughtful thesis. Bottom line:

$ onstat -c MAX_FILL_DATA_PAGES
1

Exactly as Andreas supposed. There is an oncheck command (-T?) to report on the fullness of pages within a TBLspace but in a database this size that's kinda impractical.

But I will recommend changing that onconfig parameter. Certainly worth a try!

Thanks much to all. I'll let y'all know if this solved it.

------------------------------
Jacob Salomon

Original Message:
Sent: Tue May 27, 2025 04:39 PM
From: Art Kagel
Subject: What triggers a blocking checkpoint?

Jacob:

Need more details like the reported trigger in the onstat -g ckp report or the supporting sysmaster table. Since only a short list of checkpoints are kept online, the best thing to do would be to set up a cron job to run every 15 mins from 6:00 until 10:00 daily and post or send me the output file(s).

Art

------------------------------
Art S. Kagel, President and Principal Consultant
ASK Database Management Corp.
www.askdbmgt.com

Original Message:
Sent: Tue May 27, 2025 02:29 PM
From: Jacob Salomon
Subject: What triggers a blocking checkpoint?