Thank you very much Andrew,
Problem determination is in progress, seems something was changed in the Network between the HQ and DR site, which increased the iowait and taking the CPU resources. I have stopped the replication between the HQ and DR site for now which helped to make the business process normal. I have opened the case with IBM before posting the question here as also advised by Om Prakash, thank you Om Prakash.
I also want to know if I can change the configuration to replicate the data between two sites by using one of the passive node in HQ so less overhead on the active node in RDQM HA at HQ... just thinking to avoid similar issue in future.
Thank you,
Regards,
Rajesh
------------------------------
RAJESH VERMA
------------------------------
Original Message:
Sent: Fri January 12, 2024 05:06 AM
From: Andrew Hickson
Subject: RDQM DR/HA -- Performance Impact
You have to be very careful with wht you red into IOWAIT in an MQ environment. What follows is mostly generic MQ advice, rather than RDQM specific advice.
In most MQ environments nearly all of the forced IO should be to the MQ recovery log. The way the recovery log works is that all the active hConn's essentially append to the log buffer. Each time some hConn required their IO to be guaranteed as much as can be efficiently written from the log buffer will be written in a single forced write. When that write completes the logger will check to see if any other hConn has requested further IO to be forced and if so will immediately schedule another write (again the biggest write that can be efficiently scheduled based upon what data other tasks have appended to the log buffer). The overall effect of this is a batching effect where a small number of large writes are issued, rather than a high number of small writes. The algorithm works well with a wide variety of IO latencies, as might be expected given MQ's long history and therefore exposure to different IO technologies.
In an HA/DR environment there tends to be more IO latency (as the IO has to be replicated to a remote node) and thus the tendancy is towards a smaller number of larger writes (assuming sufficient concurrency in the application workload to keep appending to the log buffer). In such a situation very high IOWAIT times would be expected.
Have you run amqsrua to look at the LOG statistics ? in particular the write sizes and the IO latency.
Regarding the high load average, have you looked at the high level MQI statistics to compare the number of MQI calls of different types ? If you compare the number of successful MQPUT's with the total number of MQI calls in any interval you'll get some idea as to the efficiency of your applications. For example an application that does MQCONN;MQOPEN(request);MQOPEN(reply);MQPUT(request); MQGET(reply); MQCLOSE(request);MQCLOSE(reply);MQDISC will use MUCH more CPU time than one which does
MQCONN;MQOPEN; MQOPEN
while(X)
MQPUT
MQGET
end-while
MQCLOSE
MQCLOSE
MQDISC
Looking at high level MQI stats would be a good first step in lookin at unexpectedly high CPU usage.
------------------------------
Andrew Hickson
Original Message:
Sent: Thu January 11, 2024 05:39 PM
From: RAJESH VERMA
Subject: RDQM DR/HA -- Performance Impact
Hello,
Please help me to find the remedy to slow-responsive queue manager running in RDQM DR/HA env. The messages logs showing following error;
Jan 11 09:51:59 txulmqprd2 pacemaker-controld[1817]: notice: High CPU load detected: 1.200000
Jan 11 09:52:29 txulmqprd2 pacemaker-controld[1817]: notice: High CPU load detected: 1.380000
Jan 11 09:52:59 txulmqprd2 pacemaker-controld[1817]: notice: High CPU load detected: 1.310000
Jan 11 09:53:13 txulmqprd2 su[72087]: (to mqm) root on pts/0
Jan 11 09:53:29 txulmqprd2 pacemaker-controld[1817]: notice: High CPU load detected: 1.410000
Jan 11 09:53:59 txulmqprd2 pacemaker-controld[1817]: notice: High CPU load detected: 1.180000
Jan 11 09:54:05 txulmqprd2 kernel: drbd qm_mqp1_uv.dr _remote: [drbd_s_qm_mqp1_/10522] sending time expired, ko = 6
Jan 11 09:54:29 txulmqprd2 pacemaker-controld[1817]: notice: High CPU load detected: 1.250000
Jan 11 09:54:59 txulmqprd2 pacemaker-controld[1817]: notice: High CPU load detected: 1.270000
Jan 11 09:55:01 txulmqprd2 systemd[1]: Configuration file /usr/lib/forescout/daemon/SecureConnector.service is marked executable. Please remove executable permission bits. Proceeding anyway.
Jan 11 09:55:02 txulmqprd2 kernel: drbd qm_mqp1_uv.dr _remote: [drbd_s_qm_mqp1_/10522] sending time expired, ko = 6
@@@
Its impacting the application which connects to queuemanager a bigtime. IOwait time is too high. I will be greatful if anybody can advise me.
Thank you,
Rajesh
------------------------------
RAJESH VERMA
------------------------------