Hi Ben. It sounds like we need to work together to narrow down where the response time blip tends to happen. If it's after the checkpoint has truly started, then your theory about I/O is probably on target. There's an equal chance that it's happening
before the checkpoint has a chance to start, when the checkpoint has been requested but it's waiting for threads to exit critical sections. As you probably know, during this window no other thread is allowed to enter a critical section, and although this wait time should be very short we have fixed defects in this area that allowed threads to linger there too long.
onstat -g ckp does show the longest time a thread waited to enter a critical section before each checkpoint. Are you seeing anything concerning there? Can you give me a ballpark figure for the application timeout value? Is it seconds, tenths of seconds, hundredths?
I'd be happy to get together with you and Jeff, either on the phone or over email, to narrow this down further if you like. If this is something that happens with any regularity there's a decent chance we can find the source and address it.
Thanks.
-jc
------------------------------
John Lengyel
------------------------------
Original Message:
Sent: Tue March 08, 2022 11:16 AM
From: Benjamin Thompson
Subject: Adjusting lru_min_dirty and lru_max_dirty when an instance is online
Hi John,
Thanks for the reply. I need some time to evaluate this and the effects of AUTO_LRU_TUNING properly but certainly automatic LRU tuning is something I am interested in. Right now LRU tuning is an iterative exercise, something Informix itself could accomplish much better.
So for us to adopt automatic tuning it would come down to how good the implementation is and what testing we have done (currently very little to none).
I need to ensure adequate application response times for our applications through checkpoints. Even though checkpoints are non-blocking and we have a fast all-flash storage array, they can be quite noticeable at the application layer, at least when we draw a graph of response times for certain queries. Very occasionally the response from Informix exceeds application timeouts. It is not possible to measure the response time at the server so my concern would be how you could take this requirement into account. I would be looking to keep checkpoint times under the application timeout the vast majority of the time. I have to add that the exact source of these delays (other than the fact there is a checkpoint in progress) and what are the optimal settings for LRUs, checkpoint interval etc. are not clear right now.
------------------------------
Benjamin Thompson
Original Message:
Sent: Mon March 07, 2022 02:35 PM
From: John Lengyel
Subject: Adjusting lru_min_dirty and lru_max_dirty when an instance is online
Our goal over time has been to replace manual knobs with autonomics wherever it made sense to us, so it's a little painful to go in the opposite direction, but we'll consider bumping up IFMX-I-455. Meanwhile I'm going to trust you guys with an undocumented, unsupported feature that should allow you to do exactly what you're asking for. I have not personally tested this method in a benchmark and cannot vouch for it in a production environment, but I see no reason to warn you off of it for test systems because it makes use of the same internal routines that automatic LRU tuning uses. If you try this and let me know whether it works for you, hacky as it is, that will inform decisions we eventually make on IFMX-I-455.
This method assumes you're beginning with AUTO_LRU_TUNING off. Say you want to set lru_min to 10 and lru_max to 20 for all buffer pools:
onmode -wm AUTO_LRU_TUNING="1,min=10,max=20"
onmode -wm AUTO_LRU_TUNING=0
Note that even though we finish with AUTO_LRU_TUNING off again, the new min and max settings should remain. Check with onstat -R.
Say you want to customize lru_min and lru_max for a specific buffer pool. Again these commands will only work if AUTO_LRU_TUNING is off to begin with:
onmode -wm AUTO_LRU_TUNING="1,bpool=0,min=60,max=70"
onmode -wm AUTO_LRU_TUNING=0
Note that buffer pools are numbered 0-7 on 2k systems and 0-3 on 4k systems. Pool numbers are fixed for the different page sizes. I.e. on a 2k system even if you have only two buffer pools: a 2k pool and a 16k pool, their numbers are 0 and 7, respectively.
I hope you find the automatic tuning algorithm works as well as manual tuning in terms of I/O, checkpoint, and overall performance. If manual tuning is significantly better in some scenario though, I'd be interested to hear about it. Have fun and send feedback to me or your customer advocate if you have one.
-jc
------------------------------
John Lengyel
Original Message:
Sent: Thu March 03, 2022 11:40 AM
From: Benjamin Thompson
Subject: Adjusting lru_min_dirty and lru_max_dirty when an instance is online
Hi, is there any way to effect a change to LRU min/max dirty while an instance is online?
I am aware of AUTO_LRU_TUNING which does adjust these parameters. The behaviour of this when turned on is to set all pools to 50%/60% then adjust them as per the rules in the manual. Switching it off again seems to lock the values.
Thanks in advance,
Ben.
------------------------------
Benjamin Thompson
------------------------------
#Informix