Informix

 View Only
Expand all | Collapse all

Block during checkpoint

  • 1.  Block during checkpoint

    Posted Wed May 20, 2020 09:32 PM
    Hi 

    I need some input advise what could be wrong on my configuration.
    The users reported slow / hang performance during their processes.
    Is it normal during checkpoint there are blocked?

    By refer to online log  , i got some weird situation which i never encounter at other places.

    09:07:49 Maximum server connections 805
    09:07:49 Checkpoint Statistics - Avg. Txn Block Time 0.464, # Txns blocked 2, Plog used 3189, Llog used 65988

    09:10:56 Logical Log 4803 Complete, timestamp: 0x33ae5f33.
    09:10:57 Logical Log 4803 - Backup Started
    09:11:16 Logical Log 4803 - Backup Completed
    09:12:53 Checkpoint Completed: duration was 4 seconds.
    09:12:53 Thu May 21 - loguniq 4804, logpos 0xd31d0b4, timestamp: 0x33b1cbf7 Interval: 66869

    09:12:53 Maximum server connections 805
    09:12:53 Checkpoint Statistics - Avg. Txn Block Time 0.000, # Txns blocked 3, Plog used 4935, Llog used 149330

    09:18:45 Checkpoint Completed: duration was 51 seconds.
    09:18:45 Thu May 21 - loguniq 4804, logpos 0x28723018, timestamp: 0x33ba9426 Interval: 66870

    09:18:45 Maximum server connections 805
    09:18:45 Checkpoint Statistics - Avg. Txn Block Time 7.717, # Txns blocked 17, Plog used 4777, Llog used 111889

    As u can see the # of trxns been blocked are high can be increase.

    By doing oncheck -g ckp, the bottom line shown :
    Based on the current workload, the physical log might be too small
    to accommodate the time it takes to flush the buffer pool during
    checkpoint processing. The server might block transactions during checkpoints.
    If the server blocks transactions, increase the physical log size to
    at least 23244800 KB.   ~23 GB

    DO i need to increase again?. I did increase previously base on recommended value given. I never had this big physical log at other client.

    Appreciate some advise from the member. The server was in RedHat 7.6 with Informix 12.10.FC6WE


    ------------------------------
    ABeMie
    ------------------------------


  • 2.  RE: Block during checkpoint

    IBM Champion
    Posted Wed May 20, 2020 09:47 PM
    You are only using around 5000 pages of physical log space between checkpoints so you definitely do not need more log space. The problem is probably the number of dirty pages at checkpoint time. 

    What type of checkpoints are being reported by onstat -g ckp?





  • 3.  RE: Block during checkpoint

    Posted Wed May 20, 2020 09:57 PM
    HI 
    Here I attach the onstat -g ckp  output plus onstat -d

    //

    ------------------------------
    ABeMie
    ------------------------------

    Attachment(s)

    txt
    onstat-d.txt   2 KB 1 version
    txt
    onstat-ckp.txt   3 KB 1 version


  • 4.  RE: Block during checkpoint

    Posted Thu May 21, 2020 03:02 AM

    Hi,

    As Art said, the info in the online.log message may be an indicator of checkpoint slowness, but you NEED to view onstat -g ckp to have a final diagnostic on what is happening.
    A long checkpoint may be acceptable if you have low values (i.e say below 0.5 secs or less) in the Wait Time and Long Time columns columns. Those columns reflect during how long users transactions wait for the checkpoint.
    In your case,  you have 5.5, 1.5 12.6 8.8 wait times, which are "too long" and users waiting at those times.
    Looking at the Dirty Buffers column, correlating with the Ckpt Time of the same line, you can see that the time is not proportional to the number of of flushed buffers for instance at 09:18:44 it takes 51 sec to flush 61141 buffers with 12.6 secs of WAIT, and at 09:34:00 it takes 26.3 secs with 0.7 sec of waits  time, for about the same amount of buffers. This is not consistant.
    60000 buffers is 120 Mb, which is not a lot of data(if your system page size is 2K). It takes 28.5 secs to write, which is slow.

    To have an idea of disk performance, check the specs of the disk you have, and you will see that a disk is capable to write much more Megabytes in much less time. Reasons can again be multiple:
    - physical disk contention: your onstat -d says that ALL your chunks are located in the same file system, which is not good. Putting all the chunks in the same file system is the worst case situation. Having the physical log on 2 chunks may also not be a very good idea

    - if you run on a VM environment, with badly shared CPU and disk resources, this won't help

    - you may also have contention on Informix threads, due to the fact that all your chunks are sitting at the same place, or any oher reason. running onstat -z, then twice onstat -g spi with an interval of 10 minutes may point to what is waiting.

    - also run onstat -g iof and look at the chunks thruput

    - finally run iostat -d (after identifying which disk you are using) and see how the physical disk performs



    ------------------------------
    [eric] [Vercelletto] []
    [Founder]
    [kandooerp.org]
    [Pont l'Abbé] [France]
    [+33 626 52 50 68]

    Disclaimer: My own opinions are my own opinions and do not reflect on the IIUG, nor any other organization with which I am associated either explicitly, implicitly, or by inference. Neither do those opinions reflect those of other individuals affiliated with any entity with which I am affiliated nor those of the entities themselves.
    ------------------------------



  • 5.  RE: Block during checkpoint

    Posted Thu May 21, 2020 03:46 AM
    THnaks Mr Eric for the sharing. For your information the system running in VM. All the tips given still useful and meaning full to narrow down the issues. Correct?


    ------------------------------
    ABeMie
    ------------------------------



  • 6.  RE: Block during checkpoint

    IBM Champion
    Posted Thu May 21, 2020 09:10 AM
    ABeMie:

    OK, I've also looked at the onstat -g ckp output. Look at the difference between checkpoint numbers 66859 and 66860. They are flushing nearly the same number of pages, but the first one flushes almost 9000 pages per second while the second checkpoint flushed less than 2600 pages per second. That is why, as Eric points out, that the second checkpoint took more than four time as long as the first one. The most likely reason for this contention for IO resources! Are you sharing SAN structures with some other applications? Machines? Databases sharing storage with Windows systems and especially mail servers is especially bad because the data access patterns are SO different. It's doesn't look like the disks or channels are slow because sometimes the IO is very fast.

    You should probably monitor the onstat -g ioh report looking for which chunks and at what times during the day the READ or WRITE service times are longer than say 15ms (0.015s) but only look at lines with significant IO counts (over 1000). The report shows IO performance for every chunk for the past 60 minutes.

    Art

    Art S. Kagel, President and Principal Consultant
    ASK Database Management


    Disclaimer: Please keep in mind that my own opinions are my own opinions and do not reflect on the IIUG, nor any other organization with which I am associated either explicitly, implicitly, or by inference.  Neither do those opinions reflect those of other individuals affiliated with any entity with which I am affiliated nor those of the entities themselves.










  • 7.  RE: Block during checkpoint

    Posted 20 days ago
    Hi Mr Art 

    I'm refer back to my old 2020.
    I wanted to share my onstat result.
    We are running Financial application OLTP whereby apps and db in seperate VM. 
    With 64GB of RAM and 32 CPU been allocated to the VM , necessary config has been tune. Unfortunaly recently we hit system hang and  users start complain system took long time for them to get into application menu.
    BY looking on DB level attach are what I need to share.
    Appreciate an expert advise

    BUFFERPOOL default,buffers=10000,lrus=16,lru_min_dirty=50.00,lru_max_dirty=60.50
    BUFFERPOOL size=2k,buffers=16000000,lrus=128,lru_min_dirty=50,lru_max_dirty=60


    ------------------------------
    ABeMie
    ------------------------------

    Attachment(s)

    txt
    onstat-ioh.txt   45 KB 1 version
    txt
    onstat-d.txt   2 KB 1 version
    txt
    onstat-iof.txt   3 KB 1 version
    txt
    onstat-F.txt   1 KB 1 version
    txt
    onstat-ckp.txt   3 KB 1 version
    txt
    onstat-c.txt   81 KB 1 version


  • 8.  RE: Block during checkpoint

    Posted 19 days ago
    Hi ABeMie,

    from looking at your posted log extracts, I can see you encountered a
    very long (>200 sec) checkpoint and a small number of longer checkpoints
    around the same time  (~40 sec).
    The setup is a HDR setup and you gave us some hints from your configuration.
    However, this is far from being complete information in order to analyze the issue.
    (since it is probably related to a specific operation which was executed around that
    time). 

    The weird situation here is that during the long checkpoint, which caused waits, 
    the number of dirty buffers flushed was not that high in comparison to other checkpoints,
    which would explain why the checkpoint duration was that long.
    This might point to an issue with the underlying storage system (or if you are sharing storage
    with the OS, a parallel activity on the machine which caused the IO to slow down.
    It might also be that the secondary server has a role here, but I cannot see why it should
    slow down a checkpoint.
    A very intensive read operation (sequential scan over a large table), maybe in parallel
    could have caused an IO shorting as well. Without looking at onstat -g sql output at the time
    of the long checkpoint, it is not clear what happened.

    Regarding your configuration, you would change the number of cleaners (currently 16)
    probably to 32 which need to clean 127 LRUs. (VPCLASS num cpu need to be adjusted as well if you want to increase the number of cleaners in order to get a cpu worker assigned).

    This could affect checkpoint behaviour (e.g. when a huge load operation occurs and buffers reach
    the upper limit of 60%).
    However, this seems to be not the case, when looking at the number of dirty buffers flushed.
    Also, from onstat -F output we can see that there are no LRU writes at all, which means that the
    cleaners were not affected.

    So, without knowing what was done in other areas on the system at the time of the long checkpoint
    (check for system stats using iostat, ps, top) it is not easy to say what caused this unexpected situation.

    Questions coming up:
    Are you working in a virtualized environment ? Where are your chunks placed (all on one device
    or on separate devices) ? Are the devices probably shared with other virtual machines
    or the OS ?
    Does the situation occur regularly at a specific time of the day ? (search for a cronjob which
    might be the underlying reason for this)

    Best,

    MARCUS HAARMANN






  • 9.  RE: Block during checkpoint

    IBM Champion
    Posted 18 days ago
    I wrote you a LONG reply yesterday, but for some reason it would no save and just kept spinning. Several hours later the form just cleared itself and I have start over. So here goes, this will be shorter so if you need clarification on anything, reach out to me directly.
    • Several of the chunks in your data dbspace are showing write service times over 120ms which is completely unacceptable. This is one reason for the long checkpoint times. You should be getting write service times under 10ms on any modern IO subsystem. Could the underlying storage by RAID5 or RAID6? That would be consistent with the performance I'm seeing. 
    • The onstat -F output shows 100% of your dirty page flushes happening at checkpoint time. Ideally you should be getting 60% to 75% of your flushes happening at LRU/idle write time. To get there you will have to reduce the lru_min_dirty and lru_max_dirty in the BUFFERPOOL parameter from 50 & 60 down to 5 & 10 then play with it from there. You may ultimately get VERY aggressive and set down to 0.5 & 1.0, but I would start with 5 & 10 and see how that does.
    • You have 128 LRU queues but rather few CLEANERS. The number of CLEANERS is fine to support checkpoint time flushing of the number of chunks you have, but now that you will be shifting 2/3 of your flushing to LRU writes you need to support those LRU queues. I would increase CLEANERS to 64 as it is unlikely that all 128 LRU queues will be flushing at the same time.
    • Enable the TEMPTAB_NOLOG parameter (set to 1) which will reduce the number of temp tables that are created in the rootdb space.
    There are other things I see in your configuration that I would address with a client after a full healthcheck (see my web site for info on scheduling a healthcheck - ASK Database Management) but that's another story and most of the things I would want to look at more deeply do not affect checkpoint duration.

    Art

    ------------------------------
    Art S. Kagel, President and Principal Consultant
    ASK Database Management Corp.
    www.askdbmgt.com
    ------------------------------



  • 10.  RE: Block during checkpoint

    IBM Champion
    Posted 15 days ago
    ABeMie:

    You sent me a private response that I think require a reply from me, but for some reason the site won't let me get to it to answer. So, here is a reply that can be public:

    I understand your issues with taking down time. However, there is no other way to implement the changes I have recommended and, as far as I know, no other way to remediate your basic problem. 

    Also, the biggest problem may be the very slow storage and your client will have to address that as well for ultimate resolution of the issues at hand. Since the site is on v12.10 you cannot migrate the data to faster storage without downtime. If the site can be upgraded to v14.10 then the storage can be migrated online using the Informix mirroring feature because v14.10 allows you to switch which of the mirror chunks is primary, so you can mirror the existing chunks to faster storage, then once the mirrors have sync'd you can make the mirror chunks primary then drop the old, now mirror, chunks.

    Tha said, the other suggestions will make a significant improvement once you have the maintenance window in which to implement them.

    Art

    ------------------------------
    Art S. Kagel, President and Principal Consultant
    ASK Database Management Corp.
    www.askdbmgt.com
    ------------------------------



  • 11.  RE: Block during checkpoint

    Posted 8 days ago