Hi ABeMie,
from looking at your posted log extracts, I can see you encountered a
very long (>200 sec) checkpoint and a small number of longer checkpoints
around the same time (~40 sec).
The setup is a HDR setup and you gave us some hints from your configuration.
However, this is far from being complete information in order to analyze the issue.
(since it is probably related to a specific operation which was executed around that
time).
The weird situation here is that during the long checkpoint, which caused waits,
the number of dirty buffers flushed was not that high in comparison to other checkpoints,
which would explain why the checkpoint duration was that long.
This might point to an issue with the underlying storage system (or if you are sharing storage
with the OS, a parallel activity on the machine which caused the IO to slow down.
It might also be that the secondary server has a role here, but I cannot see why it should
slow down a checkpoint.
A very intensive read operation (sequential scan over a large table), maybe in parallel
could have caused an IO shorting as well. Without looking at onstat -g sql output at the time
of the long checkpoint, it is not clear what happened.
Regarding your configuration, you would change the number of cleaners (currently 16)
probably to 32 which need to clean 127 LRUs. (VPCLASS num cpu need to be adjusted as well if you want to increase the number of cleaners in order to get a cpu worker assigned).
This could affect checkpoint behaviour (e.g. when a huge load operation occurs and buffers reach
the upper limit of 60%).
However, this seems to be not the case, when looking at the number of dirty buffers flushed.
Also, from onstat -F output we can see that there are no LRU writes at all, which means that the
cleaners were not affected.
So, without knowing what was done in other areas on the system at the time of the long checkpoint
(check for system stats using iostat, ps, top) it is not easy to say what caused this unexpected situation.
Questions coming up:
Are you working in a virtualized environment ? Where are your chunks placed (all on one device
or on separate devices) ? Are the devices probably shared with other virtual machines
or the OS ?
Does the situation occur regularly at a specific time of the day ? (search for a cronjob which
might be the underlying reason for this)
Best,