Informix

View Only

Expand all | Collapse all

Block during checkpoint

1. Block during checkpoint

0 Like
ABeMie
Posted Wed May 20, 2020 09:32 PM

Reply
Hi

I need some input advise what could be wrong on my configuration.
The users reported slow / hang performance during their processes.
Is it normal during checkpoint there are blocked?

By refer to online log , i got some weird situation which i never encounter at other places.

09:07:49 Maximum server connections 805
09:07:49 Checkpoint Statistics - Avg. Txn Block Time 0.464, # Txns blocked 2, Plog used 3189, Llog used 65988

09:10:56 Logical Log 4803 Complete, timestamp: 0x33ae5f33.
09:10:57 Logical Log 4803 - Backup Started
09:11:16 Logical Log 4803 - Backup Completed
09:12:53 Checkpoint Completed: duration was 4 seconds.
09:12:53 Thu May 21 - loguniq 4804, logpos 0xd31d0b4, timestamp: 0x33b1cbf7 Interval: 66869

09:12:53 Maximum server connections 805
09:12:53 Checkpoint Statistics - Avg. Txn Block Time 0.000, # Txns blocked 3, Plog used 4935, Llog used 149330

09:18:45 Checkpoint Completed: duration was 51 seconds.
09:18:45 Thu May 21 - loguniq 4804, logpos 0x28723018, timestamp: 0x33ba9426 Interval: 66870

09:18:45 Maximum server connections 805
09:18:45 Checkpoint Statistics - Avg. Txn Block Time 7.717, # Txns blocked 17, Plog used 4777, Llog used 111889

As u can see the # of trxns been blocked are high can be increase.

By doing oncheck -g ckp, the bottom line shown :
Based on the current workload, the physical log might be too small
to accommodate the time it takes to flush the buffer pool during
checkpoint processing. The server might block transactions during checkpoints.
If the server blocks transactions, increase the physical log size to
at least 23244800 KB. ~23 GB

DO i need to increase again?. I did increase previously base on recommended value given. I never had this big physical log at other client.

Appreciate some advise from the member. The server was in RedHat 7.6 with Informix 12.10.FC6WE

------------------------------
ABeMie
------------------------------

#Informix
2. RE: Block during checkpoint

0 Like
IBM Champion

Art Kagel
Posted Wed May 20, 2020 09:47 PM

Reply
You are only using around 5000 pages of physical log space between checkpoints so you definitely do not need more log space. The problem is probably the number of dirty pages at checkpoint time.

What type of checkpoints are being reported by onstat -g ckp?

Original Message
3. RE: Block during checkpoint

0 Like
ABeMie
Posted Wed May 20, 2020 09:57 PM
| view attached (2)

Reply
HI
Here I attach the onstat -g ckp output plus onstat -d

//

------------------------------
ABeMie
------------------------------

Attachment(s)

onstat-d.txt 2 KB 1 version

onstat-ckp.txt 3 KB 1 version

Original Message
4. RE: Block during checkpoint

0 Like
Eric Vercelletto
Posted Thu May 21, 2020 03:02 AM

Reply
Hi,

As Art said, the info in the online.log message may be an indicator of checkpoint slowness, but you NEED to view onstat -g ckp to have a final diagnostic on what is happening.
A long checkpoint may be acceptable if you have low values (i.e say below 0.5 secs or less) in the Wait Time and Long Time columns columns. Those columns reflect during how long users transactions wait for the checkpoint.
In your case, you have 5.5, 1.5 12.6 8.8 wait times, which are "too long" and users waiting at those times.
Looking at the Dirty Buffers column, correlating with the Ckpt Time of the same line, you can see that the time is not proportional to the number of of flushed buffers for instance at 09:18:44 it takes 51 sec to flush 61141 buffers with 12.6 secs of WAIT, and at 09:34:00 it takes 26.3 secs with 0.7 sec of waits time, for about the same amount of buffers. This is not consistant.
60000 buffers is 120 Mb, which is not a lot of data(if your system page size is 2K). It takes 28.5 secs to write, which is slow.

To have an idea of disk performance, check the specs of the disk you have, and you will see that a disk is capable to write much more Megabytes in much less time. Reasons can again be multiple:
- physical disk contention: your onstat -d says that ALL your chunks are located in the same file system, which is not good. Putting all the chunks in the same file system is the worst case situation. Having the physical log on 2 chunks may also not be a very good idea

- if you run on a VM environment, with badly shared CPU and disk resources, this won't help

- you may also have contention on Informix threads, due to the fact that all your chunks are sitting at the same place, or any oher reason. running onstat -z, then twice onstat -g spi with an interval of 10 minutes may point to what is waiting.

- also run onstat -g iof and look at the chunks thruput

- finally run iostat -d (after identifying which disk you are using) and see how the physical disk performs

------------------------------
[eric] [Vercelletto] []
[Founder]
[kandooerp.org]
[Pont l'Abbé] [France]
[+33 626 52 50 68]

Disclaimer: My own opinions are my own opinions and do not reflect on the IIUG, nor any other organization with which I am associated either explicitly, implicitly, or by inference. Neither do those opinions reflect those of other individuals affiliated with any entity with which I am affiliated nor those of the entities themselves.
------------------------------

Original Message
5. RE: Block during checkpoint

0 Like
ABeMie
Posted Thu May 21, 2020 03:46 AM

Reply
THnaks Mr Eric for the sharing. For your information the system running in VM. All the tips given still useful and meaning full to narrow down the issues. Correct?

------------------------------
ABeMie
------------------------------

Original Message
6. RE: Block during checkpoint

0 Like
IBM Champion

Art Kagel
Posted Thu May 21, 2020 09:10 AM

Reply
ABeMie:

OK, I've also looked at the onstat -g ckp output. Look at the difference between checkpoint numbers 66859 and 66860. They are flushing nearly the same number of pages, but the first one flushes almost 9000 pages per second while the second checkpoint flushed less than 2600 pages per second. That is why, as Eric points out, that the second checkpoint took more than four time as long as the first one. The most likely reason for this contention for IO resources! Are you sharing SAN structures with some other applications? Machines? Databases sharing storage with Windows systems and especially mail servers is especially bad because the data access patterns are SO different. It's doesn't look like the disks or channels are slow because sometimes the IO is very fast.

You should probably monitor the onstat -g ioh report looking for which chunks and at what times during the day the READ or WRITE service times are longer than say 15ms (0.015s) but only look at lines with significant IO counts (over 1000). The report shows IO performance for every chunk for the past 60 minutes.

Art

Art S. Kagel, President and Principal Consultant
ASK Database Management
www.askdbmgt.com

Blog: http://informix-myview.blogspot.com/

Disclaimer: Please keep in mind that my own opinions are my own opinions and do not reflect on the IIUG, nor any other organization with which I am associated either explicitly, implicitly, or by inference. Neither do those opinions reflect those of other individuals affiliated with any entity with which I am affiliated nor those of the entities themselves.

Original Message
7. RE: Block during checkpoint

0 Like
ABeMie
Posted Wed July 27, 2022 10:29 PM
| view attached (6)

Reply
Hi Mr Art

I'm refer back to my old 2020.
I wanted to share my onstat result.
We are running Financial application OLTP whereby apps and db in seperate VM.
With 64GB of RAM and 32 CPU been allocated to the VM , necessary config has been tune. Unfortunaly recently we hit system hang and users start complain system took long time for them to get into application menu.
BY looking on DB level attach are what I need to share.
Appreciate an expert advise

BUFFERPOOL default,buffers=10000,lrus=16,lru_min_dirty=50.00,lru_max_dirty=60.50
BUFFERPOOL size=2k,buffers=16000000,lrus=128,lru_min_dirty=50,lru_max_dirty=60

------------------------------
ABeMie
------------------------------

Attachment(s)

onstat-ioh.txt 45 KB 1 version

onstat-d.txt 2 KB 1 version

onstat-iof.txt 3 KB 1 version

onstat-F.txt 1 KB 1 version

onstat-ckp.txt 3 KB 1 version

onstat-c.txt 81 KB 1 version

Original Message
8. RE: Block during checkpoint

0 Like
IBM Champion

Marcus Haarmann
Posted Fri July 29, 2022 03:17 AM

Reply
Hi ABeMie,

from looking at your posted log extracts, I can see you encountered a

very long (>200 sec) checkpoint and a small number of longer checkpoints

around the same time (~40 sec).

The setup is a HDR setup and you gave us some hints from your configuration.

However, this is far from being complete information in order to analyze the issue.

(since it is probably related to a specific operation which was executed around that

time).

The weird situation here is that during the long checkpoint, which caused waits,

the number of dirty buffers flushed was not that high in comparison to other checkpoints,

which would explain why the checkpoint duration was that long.

This might point to an issue with the underlying storage system (or if you are sharing storage

with the OS, a parallel activity on the machine which caused the IO to slow down.

It might also be that the secondary server has a role here, but I cannot see why it should

slow down a checkpoint.

A very intensive read operation (sequential scan over a large table), maybe in parallel

could have caused an IO shorting as well. Without looking at onstat -g sql output at the time

of the long checkpoint, it is not clear what happened.

Regarding your configuration, you would change the number of cleaners (currently 16)

probably to 32 which need to clean 127 LRUs. (VPCLASS num cpu need to be adjusted as well if you want to increase the number of cleaners in order to get a cpu worker assigned).

This could affect checkpoint behaviour (e.g. when a huge load operation occurs and buffers reach

the upper limit of 60%).

However, this seems to be not the case, when looking at the number of dirty buffers flushed.

Also, from onstat -F output we can see that there are no LRU writes at all, which means that the

cleaners were not affected.

So, without knowing what was done in other areas on the system at the time of the long checkpoint

(check for system stats using iostat, ps, top) it is not easy to say what caused this unexpected situation.

Questions coming up:

Are you working in a virtualized environment ? Where are your chunks placed (all on one device

or on separate devices) ? Are the devices probably shared with other virtual machines

or the OS ?

Does the situation occur regularly at a specific time of the day ? (search for a cronjob which

might be the underlying reason for this)

Best,

MARCUS HAARMANN

Original Message
9. RE: Block during checkpoint

0 Like
IBM Champion

David Williams
Posted Sat July 22, 2023 05:08 AM

Reply
Hi,

"It might also be that the secondary server has a role here, but I cannot see why it should

slow down a checkpoint."

Also onstat shows "On-Line (Prim)" that means there is an HDR.

If the HDR gets too far behind or the primary has to wait for the HDR to complete a checkpoint then you will get a "G" in onstat -u (Waiting for Logical Log Buffer) that will not help things.

Regards,
David.

------------------------------
David Williams
------------------------------

Original Message
10. RE: Block during checkpoint

0 Like
IBM Champion

Art Kagel
Posted Fri July 29, 2022 07:32 AM

Reply
I wrote you a LONG reply yesterday, but for some reason it would no save and just kept spinning. Several hours later the form just cleared itself and I have start over. So here goes, this will be shorter so if you need clarification on anything, reach out to me directly.

Several of the chunks in your data dbspace are showing write service times over 120ms which is completely unacceptable. This is one reason for the long checkpoint times. You should be getting write service times under 10ms on any modern IO subsystem. Could the underlying storage by RAID5 or RAID6? That would be consistent with the performance I'm seeing.

The onstat -F output shows 100% of your dirty page flushes happening at checkpoint time. Ideally you should be getting 60% to 75% of your flushes happening at LRU/idle write time. To get there you will have to reduce the lru_min_dirty and lru_max_dirty in the BUFFERPOOL parameter from 50 & 60 down to 5 & 10 then play with it from there. You may ultimately get VERY aggressive and set down to 0.5 & 1.0, but I would start with 5 & 10 and see how that does.

You have 128 LRU queues but rather few CLEANERS. The number of CLEANERS is fine to support checkpoint time flushing of the number of chunks you have, but now that you will be shifting 2/3 of your flushing to LRU writes you need to support those LRU queues. I would increase CLEANERS to 64 as it is unlikely that all 128 LRU queues will be flushing at the same time.

Enable the TEMPTAB_NOLOG parameter (set to 1) which will reduce the number of temp tables that are created in the rootdb space.

There are other things I see in your configuration that I would address with a client after a full healthcheck (see my web site for info on scheduling a healthcheck - ASK Database Management) but that's another story and most of the things I would want to look at more deeply do not affect checkpoint duration.

Art

------------------------------
Art S. Kagel, President and Principal Consultant
ASK Database Management Corp.
www.askdbmgt.com
------------------------------

Original Message
11. RE: Block during checkpoint

0 Like
IBM Champion

Art Kagel
Posted Mon August 01, 2022 08:49 AM

Reply
ABeMie:

You sent me a private response that I think require a reply from me, but for some reason the site won't let me get to it to answer. So, here is a reply that can be public:

I understand your issues with taking down time. However, there is no other way to implement the changes I have recommended and, as far as I know, no other way to remediate your basic problem.

Also, the biggest problem may be the very slow storage and your client will have to address that as well for ultimate resolution of the issues at hand. Since the site is on v12.10 you cannot migrate the data to faster storage without downtime. If the site can be upgraded to v14.10 then the storage can be migrated online using the Informix mirroring feature because v14.10 allows you to switch which of the mirror chunks is primary, so you can mirror the existing chunks to faster storage, then once the mirrors have sync'd you can make the mirror chunks primary then drop the old, now mirror, chunks.

Tha said, the other suggestions will make a significant improvement once you have the maintenance window in which to implement them.

Art

------------------------------
Art S. Kagel, President and Principal Consultant
ASK Database Management Corp.
www.askdbmgt.com
------------------------------

Original Message
12. RE: Block during checkpoint

0 Like
ABeMie
Posted Mon August 08, 2022 05:20 AM
| view attached (5)

Reply
Hi

JUst want to share the statistics.
Currently Secondary DB server been used.

The statistics were collected during the user reported slow in access and number of transaction posting incomplete.
FYI the application created many temp tables and for every programs at least 9-15 temp tables are created. This doesnt include temp table for specific process like reporting, posting process ,and etc.

DUring that problem although there are #of waits but the flush to disk the time taken is low compare to previous statistics I shared. The top load average are low

Hope an expert advise. DO i need to split the temp chunks to different places (disk ).

------------------------------
ABeMie
------------------------------

Attachment(s)

onstat-ckp.out 3 KB 1 version

iostat-5.out 14 KB 1 version

onstat-p.out 1 KB 1 version

onstat-ioh.out 3 KB 1 version

vmstat.out 2 KB 1 version

Original Message
13. RE: Block during checkpoint

0 Like
IBM Champion

Art Kagel
Posted Mon August 08, 2022 09:18 AM

Reply
ABeMie:

Best practice for temp dbspaces:

Have at between three and six temp dbspaces to maximize sort performance.

If you use temp tables extensively in multiple concurrent sessions for intermediate results that are reprocessed, then the more temp dbspaces the better. I have clients running as many as 20.

Placing each dbspace that is highly active (temp spaces tend to be and can be involved in as much as 30-50% or your total IOs) on a different physical structure will improve performance. With modern SAN/NAS/NFS storage the best you can do is to use LUNs created on separate logical structures.

Place temp dbspaces in your fastest storage.

------------------------------
Art S. Kagel, President and Principal Consultant
ASK Database Management Corp.
www.askdbmgt.com
------------------------------

Original Message
14. RE: Block during checkpoint

0 Like
IBM Champion

David Williams
Posted Sat July 22, 2023 05:15 AM

Reply
Hi,

The one thing that Version 14 brings is

https://www.ibm.com/support/pages/node/6832820

14.10.FC8 includes

https://www.ibm.com/support/pages/apar/IT40986

IT40986: ENHANCEMENT FOR CHECKPOINT PROCESSING

which can reduce the number writes to reserved pages during a checkpoint by a factor of 100!

This should help with slow storage, and could be backported in a Version 12 special build if needed

You are welcome!

Regards,
David.

------------------------------
David Williams
------------------------------

Original Message
15. RE: Block during checkpoint

0 Like
IBM Champion

David Williams
Posted Sat July 22, 2023 05:12 AM

Reply
Hi,

"Since the site is on v12.10 you cannot migrate the data to faster storage without downtime. If the site can be upgraded to v14.10 then the storage can be migrated online using the Informix mirroring feature because v14.10 allows you to switch which of the mirror chunks is primary, so you can mirror the existing chunks to faster storage, then once the mirrors have sync'd you can make the mirror chunks primary then drop the old, now mirror, chunks."

This feature "Switch Primary and Mirror Chunks" is also in 12.10.xC10 !

https://www.ibm.com/docs/en/informix-servers/12.10?topic=overview-whats-new-in-informix#concept_v12.10.xc10__switch

Regards,
David.

------------------------------
David Williams
------------------------------

Original Message

Informix

Block during checkpoint

ABeMieWed May 20, 2020 09:32 PM

Art KagelWed May 20, 2020 09:47 PM

ABeMieWed May 20, 2020 09:57 PM

Eric VercellettoThu May 21, 2020 03:02 AM

ABeMieThu May 21, 2020 03:46 AM

Art KagelThu May 21, 2020 09:10 AM

ABeMieWed July 27, 2022 10:29 PM

Marcus HaarmannFri July 29, 2022 03:17 AM

David WilliamsSat July 22, 2023 05:08 AM

Art KagelFri July 29, 2022 07:32 AM

Art KagelMon August 01, 2022 08:49 AM

ABeMieMon August 08, 2022 05:20 AM

Art KagelMon August 08, 2022 09:18 AM

David WilliamsSat July 22, 2023 05:15 AM

David WilliamsSat July 22, 2023 05:12 AM

1. Block during checkpoint

2. RE: Block during checkpoint

3. RE: Block during checkpoint

4. RE: Block during checkpoint

5. RE: Block during checkpoint

6. RE: Block during checkpoint

7. RE: Block during checkpoint

8. RE: Block during checkpoint

9. RE: Block during checkpoint

10. RE: Block during checkpoint

11. RE: Block during checkpoint

12. RE: Block during checkpoint

13. RE: Block during checkpoint

14. RE: Block during checkpoint

15. RE: Block during checkpoint