MQ Clusters: Why only two Full Repositories?
Anthony_Beardsmore |July 14 2014 Updated
Ever seen the advice for ‘Best Practice’ in an MQ cluster to only have two Full Repositories and wondered why? This post is for you.
To recap, one of the fundamental components of a cluster is the Full Repository (FR). These are queue managers like any other, but chosen to be a ‘bootstrapping’ point for a cluster by holding a complete image of all the queue managers and objects (channels, queues, topics) shared in that cluster. When a new object is defined, the definition is sent up to the FRs. Then, when an application on a PR (partial repository) uses an object for the first time, the information is fetched from a FR and cached at the point of use. In both of these situations, two full repositories are used to push to or fetch from, so that if one is temporarily unavailable, another should be able to honour the update or request.
For high availability, administrators often feel that more than two FRs might be helpful. So what happens in our cluster if we introduce a 3rd or 4th FR?
When a queue manager advertises an object definition, it picks two of the full repositories to inform as before (which two is based on the normal workload balancing algorithm). These, upon receipt, will forward on to any other Full Repositories and each other. It is assumed that object definitions change relatively infrequently - you probably don’t have applications defining thousands of queues per second and advertising them to the cluster - so this intercommunication is not a big overhead and ensures all FRs are consistent (they will compare sequence numbers of the updates to make sure the latest and greatest definition flows and is stored everywhere). This also means that the partial repository, which is more likely to be running on a relatively low powered host, only needs to start two channels to advertise.
When a queue manager requests a definition, the process is almost reversed. The request is sent to two FRs, which should already know about any definitions because of the above ‘advertising’ flow. There is therefore no need to go and fetch from other Full Repositories. This is helpful as applications coming and going and accessing new/different objects is more likely to be a regular occurrence, and as applications move around we can expire definitions from some local PR caches and add them to others. Inter FR chatter is therefore minimised when this happens.
So far, so good. However, there is another piece to the puzzle which is where we run into issues. When we request information about an object, the FRs which we contacted remember that we asked for that information (this is sometimes referred to as a ‘cluster subscription’ – not to be confused with publish/subscribe application subscriptions.) These subscriptions mean that if the object changes, is deleted, or another instance is defined somewhere (for instance a secondary queue for workload balancing), the partial repository gets notified. These subscriptions only exist on the FRs we originally contacted with our request.
So what happens now if we have three or four FRs and two of them become unavailable? As always, any cached definitions at PRs are still valid and can be used for at least the next 60 days, so from the outset most applications will be fine. However, presumably we added additional FRs to try and provide high availability even for new and changed definitions during this outage.
Because we have more than two FRs, our object definition publications and our request subscriptions will have been balanced across the pool. For some given objects on some PRs, the only subscriptions will be on the two which are now out of action. So although new ‘advertising’ and new requests can be processed perfectly well by the remaining FRs, updates on those particular objects will never be received by the unlucky PRs.
The end result of this for an administrator tends to be that everything ‘seems to be working’ until we hit a problem and complete confusion – although we can see we have two FRs available (and that everyone seems to be talking to them ok), some particular changed, added, or deleted object definition has not been flowed everywhere we expected. At this point if the failed FRs cannot be soon recovered, the best option is probably to clear out the cache on the PR using the REFRESH command, which will force it to remake its subscriptions to the available FRs – but this isn’t an ideal situation to be in.
Hopefully this clarifies why the recommendation is always to keep to two Full Repositories – as long as these are kept appropriately separate from each other, they should provide a sufficient level of reliability and do not usually need to both have ‘9 9s’ availability for smooth functioning of the cluster (because of the caching nature of PRs). In the rare situation where this is genuinely not seen as sufficient, options include using HA Clustering or Multi Instance queue managers to increase the availability of the FR queue managers. I hope it is clear from this post that it is not sufficient to try and make up for unreliable systems hosting the full repositories through sheer ‘weight of numbers’. If you REALLY want to go ahead with three or more FRs the option remains, but it’s important to bear in mind the above information in planning your ‘outage’ response.