MQ Uniform Cluster Scaling Example
MQ Uniform Cluster, first introduced in V9.1.2 CD and rolled into 9.2 LTS enables capabilities to horizontally scale applications across a small set of similar horizontally scaled queue managers. Applications can be moved as necessary by the cluster to balance the workload as additional cluster members are started.
https://www.ibm.com/support/knowledgecenter/SSFKSJ_9.2.0/com.ibm.mq.con.doc/q132725_.html
In this example a simple point-to point messaging test demonstrates how we can start additional cluster members to distribute the load as the rate of message delivery is increased, with the existing applications being balanced across all available cluster members automatically.
Test Topology
In the test that follows, 12 queue managers (‘QMs’) in the uniform cluster ‘UC01’ were used. Each QM was started, when required.
Each queue manager is restricted to 4 cores. With 6 QMs on each of two 24 core machines (QMHost1 & QMHost2 below), in its own cgroup.
This simulates the addition of extra compute capacity added as required to horizontally scale the messaging layer, as you would expect in many virtual deployments such as cloud VMs and containers (Linux cgroups is the same technology used to control the resources used by containers in solutions such as RedHat OpenShift).
For this simple test all application threads are created using the JMS component of the PerfHarness test tool from the start, although in real life these would often be expected to ramp up as the load on the system increases. As we’re using MQ’s Uniform Cluster topology the application threads will automatically be rebalanced across all available queue managers in the cluster. This automatically maximises the capacity of the horizontally scaled set of queue managers.
The PerfHarness sender process was started on host AppHost1 and the PerfHarness receiver process was start on host AppHost2.
All machines in the test are connected via 40Gb network links.
Figure 1: Test Topology at Start of Test
Again, for the simplicity of the test, all 12 of the uniform cluster queue managers are created in advance, although they are started individually over the duration of the test as the load ramps up. Pre-creating the queue managers is not necessary in a real deployment as MQ supports creating and joining queue managers into a cluster dynamically, and for applications to dynamically load new connection information. Figure 1 shows the topology of the test at the start, with one queue manager started. Figure 2 shows the topology of the test at the end when all 12 queue managers have been started to absorb the increase messaging rate.
Figure 2: Test Topology at End of Test
Test Scenario
The objective of the test is to initially establish all the applications with a single queue manager active and progressively increase the messaging load to the point that the queue manager becomes the limiting factor. We periodically start up additional queue managers and see how the messaging traffic is horizontally balanced to increase capacity. The steps are:
- A single QM (UC01QM01) on QMHost1 is started in its own cgroup slice, restricting it to 4 cores of CPU.
- 180 re-connectable sender application threads are started on AppHost1, putting 2KB non-persistent messages at a rate of 1 message per second (per thread) to a single queue. Each thread connects to MQ setting its APPLTAG to ‘mqperf.sender’
- 180 re-connectable receiver application threads are started on AppHost2. Each thread connects to MQ setting its APPLTAG to ‘mqperf.receiver’
- The rate of the existing senders is increased by 115 messages / second (per sender)
- The rate of the existing senders is increased by a further 115 messages / second (per sender), stressing the started QMs.
- An additional uniform cluster member is started to accommodate the increase in throughput.
- Repeat from 5 until all 12 QMs are started.
The PerfHarness applications use a json CCDT to connect to the queue managers using a QM group of ‘UC01QM’. All the cluster QMs are in the UC01QM group and as only UC01QM01 is up when the receivers and senders are started, they all connect to that QM. Again, this is for simplicity. In real life you can add new entries into a CCDT as you create new queue managers in the uniform cluster dynamically. The connected applications will automatically reload the new information as they need to.
The only controls on the test are to increase the message rate of the existing senders and to start additional uniform cluster QMs. The increase of rate before each additional QM was started caused the existing QMs to be restricted by the cgroup CPU resource limit. When an additional QM is started, the senders and receivers are automatically re-balanced, so the rate achieved increases.
The rate achieved by the senders was measured after each increase and after each additional QM had been started.
Results
Message rates and CPU% consumed by the two MQ host machines are shown in the table below. Each time the target rate is increased (i.e. the total attempted message delivery rate of all senders), the QMs that are started reach the limit of their cgroup.slice allocation. The target rate is then achieved when the next QM is started.
Target Rate
|
Rate Achieved
|
#QMs
|
MQHost1
CPU%
|
MQHost2
CPU%
|
20,700
|
20,700
|
1
|
10.96
|
0.00
|
41,400
|
21,070
|
1
|
16.06
|
0.06
|
41,400
|
41,402
|
2
|
26.93
|
0.02
|
62,100
|
42,244
|
2
|
32.63
|
0.00
|
62,100
|
62,101
|
3
|
44.59
|
0.00
|
82,800
|
63,108
|
3
|
49.13
|
0.00
|
82,800
|
82,809
|
4
|
63.02
|
0.00
|
103,500
|
84,896
|
4
|
65.63
|
0.00
|
103,500
|
103,495
|
5
|
79.76
|
0.00
|
124,200
|
106,077
|
5
|
82.00
|
0.00
|
124,200
|
124,192
|
6
|
94.53
|
0.00
|
144,900
|
125,901
|
6
|
95.75
|
0.00
|
144,900
|
144,892
|
7
|
94.45
|
9.12
|
165,600
|
146,211
|
7
|
95.06
|
9.13
|
165,600
|
165,595
|
8
|
94.28
|
23.02
|
186,300
|
165,589
|
8
|
94.25
|
23.06
|
186,300
|
186,293
|
9
|
93.33
|
42.60
|
207,000
|
186,347
|
9
|
93.81
|
42.69
|
207,000
|
206,974
|
10
|
96.45
|
56.05
|
227,700
|
206,974
|
10
|
95.25
|
57.81
|
227,700
|
227,670
|
11
|
96.16
|
73.03
|
248,400
|
227,675
|
11
|
96.00
|
72.81
|
248,400
|
248,338
|
12
|
96.88
|
86.95
|
Table 1
Figure 3 below shows a plot of the message rate and CPU consumption of the QM hosts. Two points on the graph are indicated to show points where the message rate was increased and where a QM was subsequently started to accommodate the additional demand. As the test progresses the fixed increment in rate (representing around 85% of 1 QMs messaging capability) can be accommodated more by the larger number of QMs already started, so a noticeable step in increase can be seen at that point, followed by a further increase when the additional QM is started.
Figure 3: Message Rate and MQHost CPU%
Figure 4 shows the cgroup slice CPU (400% = 4 cores) of a selected number of QMs (UC01QM01,UC01QM02, UC01QM07 & UC01QM08) and the number of receivers connected to the first QM (UC01QM01).
When the rate is incremented for the 2nd time (towards the start of the test) UC01QM01 reaches its slice limit of 400% until the 2nd QM is started. Initial QMs on a machine tend to consume less CPU than subsequent QMs, probably due to the lesser impact of context switching across the machine. CPU consumption of QMs on MQHost2 (starting with UC01QM07 & UC01QM08, shown) start out at a lower consumption once again, though lower still than the initial QMs on MQHost1.
The numbers of requesters shown connected to UC01QM01 is representative of the number of requesters on any active QM at that point in the test. The number of connected receivers per QM will be similar, so only the one plot is shown for clarity.
Note that as the number of applications on each QM approach the same number, re-balancing takes longer, to avoid continuously moving application to achieve a perfect balance, so at the end of the test QMHost2 has some QMs that are hosting slightly less requesters than QMHost2. If the test were allowed to continue, the uniform cluster function would perfectly balance across all twelve QMs. In production there will probably be applications connecting and disconnecting continuously, so constant re-balancing when the number of applications on each QM is very close already is not desirable.
Figure 4: cgroup CPU% and Receivers Connected to UC01QM01
Comparisons with Alternative Scenarios
In addition to the test above, peak rates and CPU were measured for the following scenarios:
Fixed High Rate (1-12QMs) 180 Senders and Receivers were started, connecting to a single, active QM (UC01QM). Senders were set to deliver 248,400 messages per second from the start. Additional QMs were started up to a total of 12, to take up the load.
Fixed High Rate (12QMs) 180 Senders and Receivers were started, connecting to 12 active QMs (UC01QM-UC01QM12). Senders were set to deliver 248,400 messages per second from the start. In this case the queue manager groups functionality will distribute the connections across the 12 QMs from the start.
Fixed High Rate (2 QMs) 180 Senders and Receivers were started, connecting to 2 unrestricted, active QMs (UC01QM01 & UC01QM07) outside of a cgroup slice, enabling the single QM on each host to consume all the host’s CPU resources. Senders were set to deliver 248,400 messages per second from the start. In this case the queue manager groups functionality will distribute the connections across the 2 QMs from the start.
Test
|
Rate Achieved
|
MQHost1
CPU%
|
MQHost2
CPU%
|
Base Test
|
248,338
|
97
|
87
|
Fixed High Rate (1-12QMs)
|
248,300
|
94
|
90
|
Fixed High Rate (12QMs)
|
248,400
|
90
|
90
|
Fixed High Rate (2 unrestricted QMs)
|
248,100
|
96
|
95
|
Table 2
Table 2 above shows that with 12 uniform cluster QMs performance was similar. The load was slightly more balanced at the end, when the clients were initially connected across 12 active QMs, but over time the balancing logic of uniform cluster would expect to settle for the other cases as well.
For 2 large QMs the achieved rate was slightly less, thus, in addition to being able to only start QMs as they are needed, (with the associated saving in resources such as memory), splitting the load across multiple smaller QM’s can achieve better throughout.
In production workloads will be different to these scenarios, probably combining surges of rates and possibly client connections at peak times. A variety of performance metrics can be monitored to establish trigger points for starting additional cluster members to take up the load (e.g. CPU usage, queue depth etc). Existing applications will be balanced across the cluster, as it expands, whilst newly started applications can use queue manager groups to spread connections across the cluster members that are active at any point. Similarly a uniform cluster can be contracted as demand decreases by issuing ‘endmqm -r <QM>’ which will move applications off the QM being stopped, onto the remaining active cluster members.
Queue Manager Setup
The uniform cluster configuration utilises the -ii, -ic and -iv flags of the crtmqm command, allowing every queue manager to use exactly the same configuration which makes horizontal scaling very easy. So, to create the first QM (UC01QM01), which is also repository one of the uniform cluster:
crtmqm -lc -p 1414 -ii /nfs1/uniclus.ini -ic /nfs1/uniclus.mqsc -iv CONNAME=MQHost1(1414) UC01QM01
The -ii parameter points to a file of qm.ini attributes which is used to add or modify entries for the QM being created. For uniform cluster it must at least contain the essential entries defining the cluster (see below). It can also contain additional entries, applied to any QM created with it, like enabling FastPath channels, as in the example below. The same file can be used for all queue managers, so could be located on an NFS file system for access by QM’s being created on different hosts for example.
AutoCluster:
Repository1Conname=MQhost1(1414)
Repository1Name=UC01QM01
Repository2Conname=MQhost2(1414)
Repository2Name=UC01QM07
ClusterName=UC01
Type=Uniform
#
# Custom entries
Channels:
MQIBindType=FASTPATH
Sample uniform cluster ini file used by the -ii parameter
The -ic parameter points to an mqsc command file that is run by the queue manager created with this, on every start-up. It must contain at least the cluster receiver channel for the QM and can make use of in-built variables and variables specified on the crtmqm command (-iv parameter).
# Uniform cluster receiver channel
define channel('+AUTOCL+_+QMNAME+') chltype(clusrcvr) trptype(tcp) conname('+CONNAME+') cluster('+AUTOCL+') replace
# Uniform cluster receiver channel
alter LISTENER(SYSTEM.LISTENER.TCP.1) TRPTYPE(TCP) BACKLOG(5000)
define channel(PERF.APP.CHL) chltype(svrconn) trptype(tcp) replace
alter channel(PERF.APP.CHL) chltype(SVRCONN) sharecnv(1)
define qlocal(request1) replace
define qlocal(request2) replace
…
Sample mqsc file used by the -ic parameter of crtmqm
In the sample file above, the mqsc file makes use of the +AUTOCL+ & +QMNAME+ inbuilt variables and the +CONNAME+ variable specified on the command line to create the cluster receiver channel
When the -ii and -ic files are set up in this way, very little needs to be changed to create multiple cluster members. In this example we merely change the QM name, host and listener port on the crtmqm command. E.g. to create QMs UC01QM01 to UC01QM03 on QMHost1 and QMs UC01QM07 to UC01QM09 on QMHost2:
On QMHost1 execute:
crtmqm -lc -p 1414 -ii /nfs1/uniclus.ini -ic /nfs1/uniclus.mqsc -iv CONNAME=MQHost1(1414) UC01QM01
crtmqm -lc -p 1415 -ii /nfs1/uniclus.ini -ic /nfs1/uniclus.mqsc -iv CONNAME=MQHost1(1415) UC01QM02
crtmqm -lc -p 1416 -ii /nfs1/uniclus.ini -ic /nfs1/uniclus.mqsc -iv CONNAME=MQHost1(1416) UC01QM03
On QMHost2 execute:
crtmqm -lc -p 1414 -ii /nfs1/uniclus.ini -ic /nfs1/uniclus.mqsc -iv CONNAME=MQHost2(1414) UC01QM07
crtmqm -lc -p 1415 -ii /nfs1/uniclus.ini -ic /nfs1/uniclus.mqsc -iv CONNAME=MQHost2(1415) UC01QM08
crtmqm -lc -p 1416 -ii /nfs1/uniclus.ini -ic /nfs1/uniclus.mqsc -iv CONNAME=MQHost2(1416) UC01QM09
Machines used in the Test
QM_host1 ThinkSystem SR630
CPU 2 x 12: Intel(R) Xeon(R) Gold 6126 CPU @ 2.60GHz
QM_host2 ThinkSystem SR630
CPU 2 x 12: Intel(R) Xeon(R) Gold 6126 CPU @ 2.60GHz
App_host1 System x3550 M5
CPU 2 x 14 Intel(R) Xeon(R) E5-2690 v4 @ 2.60GHz
App_host2 System x3550 M5
CPU 2 x 14 Intel(R) Xeon(R) E5-2690 v4 @ 2.60GHz
PerfHarness Commands
For the main test above, the following PerfHarness commands were used, should you wish to run something similar. Note that you will need an up to date version of PerfHarness from Github (link below) to pick up some of the recent changes to support uniform cluster testing. The SocketCommandProcessor class was used to control the rate of the senders.
Receivers:
java -Xms768M -Xmx768M -Xmn600M JMSPerfHarness -su -wt 10 -wi 0 -nt 180 -ss 5 -sc BasicStats -rl 0 -id 5 -tc jms.r11.Receiver -d REQUEST -to 20 -db 1 -dx 1 -dn 1 -jb *UC01QM -jt mqc -pc WebSphereMQ -ccdt file:///nfs1/UC01.json -ar WMQ_CLIENT_RECONNECT -an mqperf.receiver -wp true -wc 4
Senders:
java -Xms768M -Xmx768M -Xmn600M JMSPerfHarness -su -wt 10 -wi 0 -nt 180 -ss 5 -sc BasicStats -rl 0 -id 5 -tc jms.r11.Sender -d REQUEST -to 20 -db 1 -dx 1 -dn 1 -jb *UC01QM -jt mqc -pc WebSphereMQ -ms 2048 -rt 1 -ccdt file:///nfs1/UC01.json -ar WMQ_CLIENT_RECONNECT -an mqperf.senders -cmd_c SocketCommandProcessor
Resources
PerfHarness Github repository
https://github.com/ot4i/perf-harness
Queue Manager Groups:
https://www.ibm.com/support/knowledgecenter/SSFKSJ_9.2.0/com.ibm.mq.dev.doc/q027490_.html
Uniform Clusters:
https://www.ibm.com/support/knowledgecenter/SSFKSJ_9.2.0/com.ibm.mq.pla.doc/q132720_.htm
#IBMMQ#messaging#MQ#Queuemanager#scaling