IBM MQ on AWS Deployment Options
After purchasing IBM MQ entitlement via the AWS Marketplace or directly through your sales representative, you need to decide how you will deploy your queue managers to meet your messaging requirements. IBM MQ supports a wide variety of platforms, and solutions for data replication and service availability. In this blog I’ll talk about the options that are available for installing IBM MQ on AWS and what you need to consider to be able to make an informed choice. If you need a basic solution with limited availability, getting up and running quickly with IBM MQ on AWS isn’t difficult or time consuming and can easily be achieved even if you are new to IBM MQ. To meet requirements for higher service and data availability you need to understand some general availability concepts. This blog covers those concepts and how they apply to the IBM MQ capabilities and deployment options that are available to you.
Getting started quickly
If all you want to do at this stage is get a basic system up and running without high availability and limited-use options, one of the easiest and fastest ways to get IBM MQ installed and running is to use the IBM MQ Ansible solution with a Linux virtual machine, you can follow a task-oriented blog that I have written with instructions that will get IBM MQ Developer edition installed and running in just a few minutes. The blog can be found here, https://community.ibm.com/community/user/integration/blogs/martin-evans/2023/12/13/a-quick-guide-to-installing-ibm-mq-on-ubuntu-linux
Defining the right deployment for your business
The IBM MQ Ansible project shown above can be used for more capable deployments but before we discuss automation and IBM MQ highly available solutions, we should first understand what the drivers are for a particular solution or architecture. Your requirements will guide the kind of deployment that is suitable.
Data Availability
What level of message/data loss can you tolerate if any? This is called your recovery point objective (RPO) it determines how much data or message loss you can tolerate in the event of a component failure. An RPO of zero specifies that you cannot tolerate any message loss. If you are using non-persistent messages, then you probably don’t need to be concerned with disk replication for anything other than ensuring the system can tolerate a disk failure to remain active or maintain service.
To achieve an RPO of zero you must deploy a solution that replicates the data disks that messages are persisted on. IBM MQ has several options for disk replication, one of the first considerations when choosing an option is whether you want IBM MQ to do the replication for you or whether you will rely on the platform’s storage.
IBM MQ data replication options are:
- Native HA – runs on Kubernetes, so EKS, ROSA or your own Kubernetes platform. This is a cloud-native solution that will run on any Kubernetes system but when coupled with Red Hat OpenShift offers state of the art availability, security and replication capabilities and deployment tools. Non-OpenShift platforms must build their own container image and use sample Helm charts for deployment. Native HA offers unparalleled disk replication capabilities and provides the fastest failover of all IBM MQ solutions reducing your RTO and giving you an RPO of zero. IBM MQ Native HA can use simple block storage such as EBS.
https://www.ibm.com/docs/en/ibm-mq/9.3?topic=operator-native-ha
- RDQM – only available on Red Hat Enterprise Linux (RHEL), so VM or bare metal. This is a great solution for high availability and disaster recovery. Disaster recovery has the option of synchronous and asynchronous cross-region replication. As with Native HA, RDQM also offers an RPO of zero and quick recovery times.
https://www.ibm.com/docs/en/ibm-mq/9.3?topic=multiplatforms-installing-rdqm-replicated-data-queue-managers
Platform replicated data options are:
- Single-instance queue managers with replicated storage provided by the platform. This could be something like EFS storage or FSx storage which has comprehensive options for replication, recovery times depend on the underlying platform and are usually measured in minutes. Always check the AWS documentation for the availability level of the different storage options.
- Multi-instance queue managers with shared network attached storage such as FSx. As with a single-instance queue manager multi-instance queue managers can have highly available replicated storage, but they have the added advantage of fast failover which reduces your RTO. Multi-instance works in both OpenShift and virtual machine deployments. Multi-instance queue managers rely on SAN or network attached storage, this could be EFS or FSx, whatever storage you use IBM always recommends following the guidance on storage that can be used for a multi-instance queue manager, see,
https://www.ibm.com/docs/en/ibm-mq/9.3?topic=multiplatforms-requirements-shared-file-systems
Service Availability
How much service downtime can you tolerate? This is called your recovery time objective (RTO). RTO is usually measured in nines, for example a 99.9% uptime requirement permits 8.76 hours of downtime, it is possible to achieve one hundred percent availability with an appropriate solution. Both RPO and RTO can vary on a per queue manager or application basis, but once we establish the baseline levels of RPO and RTO that are required we can identify which availability options we will need to employ. Five nines availability (5 minutes per year) is unrealistic with a single queue manager, three nines (8.76 hours) is possible but still does not allow much room for any unplanned downtime. To achieve 100% service availability, you must deploy at least two queue managers on a highly available fault tolerant platform.
If you can tolerate greater than three nines downtime a single queue manager might be sufficient, but it will still need to be on a resilient platform that can be recovered in a timely manner. For anything that requires greater availability than three nines you should consider running at least two queue managers that can provide the IBM MQ service. IBM MQ can be used with a load balancer across two queue managers (with some restrictions), or it can be configured to use an IBM MQ Uniform Cluster for high service availability, you can also configure your IBM MQ Client code to use a connection list containing more than one queue manager.
For 99.9% or less availability
- Single-instance queue manager
- Highly available queue manager, Native HA or Multi-instance
For availability greater than 99.9% or less than 8.76 hours of downtime tolerance
- Two or more queue managers in a Uniform Cluster
- Two or more queue managers with a load-balancer or client connection lists
Cross-Region Data Availability
Your cross-region or disaster recovery requirements (DR) will further narrow your choice of deployment. When we talk about cross-region or DR we are typically talking about replication that goes over a network with >10ms latency between datacentres or datacentres that are greater than 200 miles apart. Cross-region or DR distances cannot be used for something like Native HA with its synchronous replication as it requires <=10ms network latency, Kubernetes availability zones have a similar requirement. Native HA provides a cloud-native highly available solution managed by IBM MQ, but it does not presently offer a cross-region capability. Platform providers usually offer some form of cross-region data replication at the disk level, IBM MQ multi-instance can leverage this capability. When spanning across regions data is typically replicated asynchronously which can tolerate greater latency and distance but does come at the cost of potential data loss – albeit small.
IBM MQ solutions that offer cross-region data replication are:
- Single-instance queue manager with platform replicated storage.
- Multi-instance queue manager with network attached storage such as Amazon FSx.
Cross-Region Service Availability
Cross-region service availability can be achieved by running two independent queue managers in each region that are both able to offer the MQ service. IBM MQ clusters can span regions that are geographically separated and have a latency of >10ms.
Availability Zones
AWS has the concept of an availability zone, there are typically three availability zones in a region that are separate datacentres located close enough to ensure latency does not exceed ~10ms but far enough apart to avoid localised events from impacting more than one zone. Distributing your IBM MQ solution across the 3 zones can provide a highly fault tolerant system, but it does come at the cost of increased latency for disk replication. IBM MQ Native HA can leverage a 3-zone region to provide high data and service availability.
Sizing and Performance
A question I get asked on a frequent basis is how big my server needs to be or what is the minimum deployment I need, before we can answer that we need to understand what governs the performance of a queue manager, and what the size and nature of the workload will be; all of these factors will influence the size of the servers and the queue manager topology.
Resources such as CPU and memory are relatively easy to determine based on performance reports that are available, the amount of disk space required can easily be calculated based on workload and how long you want to be able to tolerate a consuming application outage, but where it gets a little more complicated is around network and disk. Network and disk performance are greatly impacted by distances and availability options.
Something like an SSD disk attached directly to a server will offer very low disk latency and high throughput but it offers no availability, a disk that is synchronized across two datacentres won’t perform anywhere near as well as the SSD, but it will provide a high level of availability in the event of a component or datacentre failure. So, if the nature of the workload requires that message data is replicated across datacentres, it must be accepted that this will probably be one of the main factors governing the performance or capacity of a single queue manager.
This is probably a good time to introduce vertical vs horizontal scaling. A vertically scaled queue manager can be thought of as a single pipe, the size and flow of the pipe will be governed by the CPU, memory, disk, and network performance. A single pipe is often big enough and has enough flow to handle small or medium workloads but once you hit the limit of that pipe you need to scale horizontally by adding more pipes. It is very common for the nature of the workload to dictate that message data is replicated within a single datacentre or across datacentres which results in a smaller single pipe.
The size of the workload is typically considered as the number of messages for a given period and the average size of the messages, but as previously mentioned the nature of the workload also plays a part, and it isn’t just replicated data that is a factor, persistent vs non-persistent messages have a huge influence on disk utilisation and TLS places an equally large demand on CPU. Peak times also need to be considered and whether any backlog caused by a burst of messages will be an issue, some systems are happy to allow backlogs to be buffered by a queue manager so long as they can catch up again.
A good place to start for sizing is the IBM MQ performance reports, the reports have a set of different workloads that are run on several platforms with different availability options. To some extent resource demands are linear and it’s certainly a good rule of thumb or place to start but do keep in mind that once you hit a bottleneck on any resource and you cannot vertically scale that resource any further, increasing other resources won’t make any difference. An example of this would be disk storage that is replicated across datacentres, once you are constrained by the speed of the network adding more CPU or memory will not make the system any faster, at this point you will need to consider horizontal scaling.
When you have an idea of the size of the server or servers and MQ deployment you require, for the reasons discussed, the next thing is to test how much workload your system will handle when deployed on the platform with the exact system specifications you are planning to use.
If you are migrating an existing IBM MQ deployment and understand the workload and utilisation of the systems it uses, you can use that data to make informed sizing decisions. You will still need to consider how your new MQ topology will support your performance requirements as the target solution might not have the same performance characteristics.
One way to calculate how much compute power you need is to deploy your target-state technical solution and run a workload that represents a portion your intended workload to create a benchmark. Once you have a benchmark for that portion of the workload you can extrapolate to get an idea of the size of the overall deployment required to meet your performance requirements.
Hot Queues
What do I mean by hot queues? The analogy I like to use is that some queues burn a bit brighter than others, if the maximum wattage of your target deployment for any single bulb is 60w and you have some bulbs running at 100w you’re going to have to replace that one with two bulbs to get the same lumens. What that means in queue terms is that you might have to make a change to your applications and split the traffic, this is probably the most challenging aspect of migration. Systems like the IBM MQ Appliance could be highly available with very high throughput.
Sizing your IBM MQ Transaction Logs
This blog won’t go into the details or the inner workings of the transaction log or sizing calculations, full details of sizing calculations and how logging works can be found on the IBM MQ documentation site, in this blog we are going to discuss some of the concepts that will help to guide your technical decisions and size your transaction log appropriately.
To help us understand transaction log sizing we can think of the transaction log as being responsible for ensuring messages are safely stored to disk when handed over to the queue manager from client applications, and in some cases providing the capability to recover messages stored on data disks.
Sizing for your transaction log will depend on the flavour of logging, circular or linear. In its simpler form of circular logging, the transaction log is only looking after your active workload until it has been consumed or written to a queue file, as the name suggests circular logging rotates over the logs and does not need any purging or clean up.
Linear logging provides the capability to recover messages or damaged objects, as such it needs capacity for the active workload, just the same as circular logging, but additionally it needs space for a copy of the messages that are in the queue files held on disk waiting to be consumed. Part of the housekeeping for linear logging requires things to be moved around, during this time linear logging can require more disk space than you might think, we recommend that your disk space for linear transaction logs should be twice the size of your data disk plus the size of your active log (including primary and secondary log files). The queue manager can be configured to automatically provide housekeeping for linear logs, but it will always retain the logs required for recovery and queue manager restarts.
Native HA uses a replicated log that shares many characteristics with linear logs in that it contains both application workload and media images. Native HA records new media images on a regular basis and so for sizing purposes, you should follow the recommendations for sizing a log of linear type.
More details can be found here, https://www.ibm.com/docs/en/ibm-mq/9.3?topic=lost-calculating-size-log
Other Considerations
Platform preference
Platform preference is often determined by an enterprise architecture mandate that will also be determining the platform of choice for your applications. It would be quite unusual for an MQ administrator to choose a platform like OpenShift for IBM MQ when all your other applications are running on VMs, however, it wouldn’t be unusual for an enterprise to mandate that all applications should be containerised and that would include IBM MQ.
Skill level
Skills can be broadly broken down into two areas when it comes to administering and deploying an IBM MQ solution, platform skills and IBM MQ itself. Platform skills can influence your choice of MQ deployment when they are aligned with your enterprise platform strategy, meaning that if your enterprise mandates a container platform like OpenShift you will have those skills in your organisation and you will be encouraged to leverage them. My recommendation is that you choose the IBM MQ system that meets your requirements and adheres to your enterprise strategy, this should ensure you have the right solution and the platform skills to support it.
There are many good reasons for automation but one of them is that it can reduce the skill level required manage to MQ. The IBM MQ Operator that runs on OpenShift automates a lot of heavy lifting for you, if you have an enterprise strategy that support running IBM MQ on Openshift it is a great choice and offers many business and technical benefits.
Automation Strategy
As I mentioned in the skills section there are many good reasons for automation, and just as you need to align your platform and system choice with your enterprise strategy you should adopt an automation strategy that will align. IBM MQ can be integrated with any automation system, but the cloud-native nature of IBM MQ Native HA lends itself particularly well to modern continuous-deployment and configuration-as-code strategies. The OpenShift platform has built-in tooling that greatly reduces risk and maintains system integrity and is fast becoming a de-facto standard. Just as the tooling in OpenShift is gaining n popularity so too is Ansible, the IBM MQ team have created an open-source project and community to further the development of IBM MQ Ansible automation. Ansible offers a comprehensive solution that can be used for container platforms but provides a great starting point for automating traditional MQ deployments.
Summary
IBM MQ offers a lot of flexibility to build the right solution on AWS to meet the needs of your business. If you take the time to understand your requirements and the constraints of a deployment, you will be able to choose the right solution based on informed decisions. And you can get up and running very quickly by using the IBM MQ Ansible solution to learn more about developing and running MQ in the meantime.