The recent significant outages from a variety of cloud providers (e.g. AWS, Google) really draw into focus the importance of ensuring your IT landscape avoids single points of failure in any dimension, whether platforms, regions, or indeed even cloud providers.
Asynchronous messaging capabilities such as IBM MQ are often used to ensure robust communication between applications in the face of outages. Should connectivity or applications fail, instead of the mesh of interactions instantly reverting to a chaos of timeouts, retries, and potential loss of data, queues can simply take the load for a while until the problem is resolved. Messaging brings robustness to the table in other ways too, assuring exactly once delivery, simplifying parallelism and workload balancing and more.
Let’s dwell for a moment on some of the specific features of IBM MQ that enable it to help enterprises stay running even in the most dramatic and unexpected failure scenarios.
Built in Resilience
The beating heart of IBM MQ is the queue manager where messages are persisted. These have resilience built-in in a variety of ways, and it is often optimised to the platform on which they are deployed. For a full rundown on the options, check out this recent blog (http://ibm.biz/mq-hadr), but here are some highlights.
Let’s start with containers. Queue managers on Kubernetes are engineered to inherently gain from the platform’s natural ability to reinstate a fallen pod. In addition, we have introduced “Native HA” which performs synchronous replication to standby instances of the same queue manager. These could be spread across availability zones within a region, to provide continuity with zero data loss even in the face of a data centre outage. The real beauty of Native HA is that it has no dependencies. No need for shared storage, no specific operating system features, all replication performed entirely by IBM MQ itself.
A brand new, fully resilient, fully configured queue manager can be stood up in a container from nothing in a few minutes. Once up, failover between HA instances is in the order of seconds. Add to that the ability to load balance across multiple active queue managers using MQ clusters, and you can have continuous service availability.
What about large-scale disasters? What if a whole region becomes unavailable? What if a cloud provider has a systematic failure? That’s where cross region replication (CRR) comes in. Highly optimised asynchronous replication enables a standby queue manager group in another region and/or cloud provider to be ready to take over on command, with no effect on the performance of the main queue manager group.

Native HA and CRR are no longer just about containers. The technology has been so popular that in the recent release (MQ 9.4.4) we made it available to Linux on VMs and bare metal too. Indeed recently one of our engineers had some fun emphasising just how portable and dependency free it is by running a $100 data centre on a collection of Rasberry Pis!
Of course, there are some environments where we explicitly want to take advantage of platform optimised behaviour. Queue managers on mainframes make use of the coupling facility to provide highly available shared queues. Another example would be the significant efficiencies we’re able to make under the covers within the MQ appliance by explicitly choosing and aligning the hardware and firmware. These platform optimizations make a huge difference, and we’ve seen well over one million transactions per second on these platforms.
Hybrid cloud messaging networks
So, you have demonstrated your hyper-resilient queue manager, and now all the domains in the business want one for their applications too! Furthermore, you need them all to be able to talk to each other, so you can confidently communicate over messaging across any part of the business. What you need is a messaging network (http://ibm.biz/mq-messaging-network) that’s dynamic, so it can change as your IT landscape moves around and modernises. You also need that network itself to be robust to failures.

It is a very common pattern for IBM MQ to be deployed as a messaging network of multiple queue managers across a hybrid multi-cloud landscape, providing exactly once delivery, dynamic routing, traffic optimization, and hardened end to end security.
IBM MQ messaging networks are built largely on the MQ clustering capability, which massively simplifies the configuration of channels between the queue managers. Queue managers can simply be created as a member of a cluster and they will instantly be able to find and communicate with the queues available on all the other queue managers in the cluster. Messages put to a queue that isn’t on the current queue manager, will be safely stored on a transmission queue until they can be transferred to the destination, so applications are completely decoupled from the span of the network.
Clusters can be spread between domains, regions and hyperscalers and it is a dynamic network, so if the location of a queue changes, cluster members will re-route accordingly. Priorities can also be set to encourage or enforce use of more locally available queues over more distant ones.
IBM MQ also enables the publish/subscribe pattern across clusters. Topic transmission is highly optimised. Event messages on the topics will only be circulated to queue managers that have subscribers to the topic. Furthermore, only the relevant events will be transmitted based on the subscriber’s specific regular expression on the hierarchical topic string. The cluster is all seeing, and yet efficient at the same time.
So, with the combination of the queue manager’s in-built resilience and portability, and its ability to be part of a cluster spanning a hybrid multi-cloud landscape, IBM MQ can go a long way to damping the effects of significant outages, even if they affect whole regions or even whole cloud providers.