MQ

 View Only

Can IBM MQ Native HA provide me with 99.999% availability?

By Jonathan Rumsey posted Sun October 15, 2023 11:58 AM

  

One question that often comes up when I talk to clients about IBM MQ Native HA is around availability targets and the industry terms RPO, RTO and SLA/SLO/SLIs. If you aren’t familiar with these acronyms, the easiest way of explaining each of them in messaging service terms is as follows;

  1. RPO (Recovery Point Objective) – In a worst case scenario, how many messages & transactions am I willing to lose in a role swap?
  2. RTO (Recovery Time Objective) – How long am I willing to wait to get back to having full messaging capability in my applications after a role swap?
  3. SLA/SLO/SLI (Service Level Agreement/Objective/Indicator) – Expressed in uptime percentage terms, how available is my messaging service? An availability of “five nines” 99.999% equates to approximately 5 minutes of downtime per year. The objective (SLO) is a target set to meet the contract (SLA) agreed with the users/clients of the messaging service and the indicator (SLI) is effectively a "how well am I actually doing?". 

The correct answer to most questions on quantifying availability is “it depends”... Whilst that’s not an entirely useful answer, it is an honest one. In truth there are a broad set of factors that can impact recovery and no two client environments are identical, so there is not a simple calculation. 

To see if a single Native HA queue manager can meet your availability requirements, let’s look at some of the main factors that impact the RPO, RTO and finally the SLO.

RPO (Recovery Point Objective)

Starting with the easiest, provided you don’t suffer a concurrent disk failure of two instances, the RPO of a Native HA queue manager essentially hinges on persistence.

  • Persistent messages (and everything else MQ logs)
    Native HA will provide an RPO of zero for persistent messages, transactions, object definitions, alterations and deletions. That is no loss of data, this is the usual assured, exactly once message delivery that everyone expects from IBM MQ. A Native HA queue manager will only confirm to an application that a commit of a syncpointed MQPUT or MQGET of a persistent message has been successful once at least 2 instances out of 3 have forced this to their logs.

  • Non-persistent messages
    On the flip side, any role swap of the active (even planned), will result in an RPO greater than zero for non-persistent messages – and that includes any queues using NPMCLASS(HIGH). It shouldn’t be a great surprise, but all non-persistent messages are discarded when the active instance changes or restarts.

RTO (Recovery Time Objective)

Harder to quantify is how quickly a Native HA queue manager will recover in a role swap and how quickly applications will be able to reconnect and resume useful work, purely because there are a lot of different factors at play;

  • Planned or Unplanned role swap
    Is the role swap planned? Where an active instance is asked to perform a planned role swap, it will perform the usual quiesce process for a queue manager, recording a checkpoint, it will also nominate which instance should become the next active. For want of a better description, this “succession planning” process can save a few valuable seconds where in its absence, the remaining instances would have had a delay (albeit small) in detecting a lack of an active instance and responding by starting an election.

    Native HA implements a consensus algorithm based on raft to replicate log data. The active instance is expected to send a regular heartbeat to ensure it remains the active and prevent a new election. If a heartbeat is not received by a replica within a heartbeat timeout period, it starts the process of electing a new leader/active which will result in a role swap.

    Tuning the heartbeat interval or heartbeat timeout down from its defaults (2.5 seconds and 5 seconds respectively) for a reliable cluster with sufficient resources will result in faster detection of failure. There is a caveat in that tuning this heartbeat too low could result in unnecessary role swaps, for example if the cluster infrastructure is over-provisioned and the active encounters periodic resource constraints, it may fail to send a heartbeat in time. See the advanced tuning for Native HA documentation for further details of heartbeat interval and timeout.

  • Kubernetes probe frequency
    The healthchecking probe used by Kubernetes is used to check Native HA instances are healthy and restart them if necessary. A similar readiness probe is used to identify the current active instance and route application traffic directed to the service address to the IBM MQ listener port on that instance. The readiness probe frequency defaults to 5 seconds, so a change in an active instance could, in a worst-case scenario, take up to 5 seconds for applications connecting to the service address to be routed to the new active. Changing the readiness probe to occur more frequently can trim a few seconds off the recovery time.

  • Client application reconnect frequency/strategy
    If using IBM MQ’s automatic client reconnect from an application, the default reconnect frequency uses a simple backoff algorithm that works under the assumption that if connectivity can’t be restored within a few seconds that the connectivity problem won’t suddenly fix itself.

    An application using this default backoff algorithm, might try reconnecting after 100 milliseconds, then 1, 2, 4, 8, 16 seconds and so on. For a client application connecting to a Native HA queue manager tuning the "ReconDelay" attribute to avoid doubling the backoff/delay quite so quickly is likely to result in a lower recovery time for planned role swaps.

    To provide the ultimate flexibility in reconnect strategy and timing, client applications could choose to code their own custom retry handling of a 2009 (MQRC_CONNECTION_BROKEN).


  • Application workload
    As with the RPO for non-persistent messages, it shouldn’t be too much of a surprise that the in-flight messaging workload and queue depth will have an impact on the time it takes to complete a role swap. If a large number of unprepared transactions were in-flight immediately prior to an unplanned role swap, the next active instance will need to rollback (undo) these transactions.

    If queue manager checkpoints are not being recorded on a frequent basis, the next active instance may need to replay/redo more work from the log to ensure the queue files are consistent.

    If you find that either the undo or redo phase during active startup is taking a long time during an unplanned role swap, consider whether applications could be using smaller short-lived transactions, or whether the checkpoint frequency could be adjusted to occur more frequently.


    The golden rule of MQ performance of not using queues like a database for long term storage and keeping queue depths as low as possible applies here. A shallow queue will require less IOPS and hence less time to load back from disk during a role swap.

  • Interconnectivity with other queue managers
    Messaging environments are typically architected to link together queue managers with distributed channels, for example sender/receiver pairs or clustering. Following a role swap, the channel initiator is responsible for restarting any channels that move messages between queue managers. Channel short retries (configured using channel attributes SHORTTMR and SHORTRTY) are intended to be used to handle temporal conditions where network connectivity cannot be established, a Native HA role swap is a good example of such a condition. 

    By default, channels are configured to have 10 short retry attempts every 60 seconds before dropping into a less frequent long retry cycle. To ensure channels can resume moving messages between queue managers following a role swap, consider tuning the channel short retry intervals to a lower value and at the same time increase the number of short retries.   

  • Incompatible cluster changes (e.g. TLS certificate replacement, CipherSpec change)
    Occasionally changes are made to a Native HA queue manager that means that a subset of instances in a cluster may not be able to communicate with the other members of the cluster until a majority have rolled out that change.

    Let’s take an example of replacing a TLS certificate with one that has a different CA trust chain, where the new certificate is not going to be trusted by the existing instances. During a rolling update of such a change the first replica being updated will not be able to form part of the quorum and the replication link from the active will be unable to send any log data. Once the first replica has completed its update and restarted, the second replica ends and starts its update, there is now an unplanned role swap as the current active instance has no replicas that it can work with and abdicates.

    Unfortunately there is nothing to prevent an outage when rolling out such a change, but the time taken in the outage can be mitigated with planning.  For example, renew or reissue a certificate ahead of its expiry, but ensure the new certificate uses the same CA trust chain so that existing instances can still communicate. Avoid using self-signed certificates as the CA trust chain is always effectively replaced. If your security policies allow it, use "ANY_" CipherSpecs to negotiate the strongest security parameters that are supported, rather than mandating a specific protocol and algorithm.

    If an incompatible cluster change is unavoidable, the time that it takes for the second replica to restart during a rolling update is critical. If the second replica was scheduled to run on a worker node that has not previously pulled a new container image, there could be a significant delay in establishing a new active whilst the image is pulled from a registry.


  • Image pull from registry 
    If a Native HA queue manager has to wait for a majority of instances to have identical/compatible configuration (see above), the time it takes to pull and deploy a new image becomes a hugely significant factor. A role swap that would typically take a few seconds could take a few minutes if an image has to be pulled from a remote registry.

    Pre-pulling container images to worker nodes in a cluster before the image is needed can be a very effective mitigation strategy. Putting incompatible configuration changes to one side, pre-pulling images ahead of any rolling update has the advantage that the rolling update across all instances will be quicker. A quicker rolling update is particularly desirable for Native HA as it results in a smaller backlog of 'log catchup' for new messaging workload accepted during the update process.

SLO (Service Level Objective)

A Native HA queue manager continues to provide messaging capability whilst there is an active instance with at least one in-sync replica, so the key to maintaining a high percentage uptime is to ensure that at least 2 out of 3 instances are always available.

To determine whether a Native HA queue manager could meet a 99.999% SLO target, can only be based on presumptive data about events that you know, or think are very likely to happen. Whilst it may be possible to be reasonably confident of the number of scheduled maintenance windows to apply fixes and migrate to newer versions based on planning, there is always the potential for unexpected events. Unfortunately, there are no crystal balls that will be able to predict the unpredictable, but using reliable infrastructure and applying preventative maintenance on a regular basis is a good starting point. 

Let’s assume that planned maintenance/configuration changes are applied every two weeks. To meet a “five nines” 99.999% SLO, that means all the fortnightly maintenance windows over a year must complete in total under 5 minutes and 13 seconds, or expressed another way that SLO will require each fortnightly role swap to complete in 12 seconds or less.

The IBM MQ Operator offers enhanced capabilities over helm charts when managing the lifecycle of a Native HA queue manager, that can reduce the number of active role swaps. The IBM MQ Operator uses the strategy of performing a rolling update on the 2 replicas first and finally, the active. Ensuring the active is the last instance to drop in a rolling update means that there should only be 1 role swap. Without the IBM MQ Operator cherry-picking the best order, you could be unlucky and have 3 successive changes in the active, all of which would multiply the downtime. 

I'm focusing on the potential availability of a single Native HA queue manager. Native HA can be combined with the scalability and availability of an IBM MQ Uniform cluster to provide dynamic partitioning of messaging traffic and further increase messaging availability.
 

TL;DR

Can a single IBM MQ Native HA queue manager provide 99.999% availability? Yes, it can. 

In fact, for many deployments, no tuning will be necessary to easily achieve this target. With some fine tuning of client auto-reconnect interval, readiness probes, etc, an MQ application can be reconnected and resume messaging to a new active in 3 seconds or under.

In messaging environments where availability and scalability are critically important, you should also consider deploying a uniform cluster of Native HA queue managers.
0 comments
64 views

Permalink