MQ

 View Only

Avoiding MQ cluster problems after Disaster Recovery testing

By Martin Gompertz posted Wed October 18, 2023 07:11 PM

  

20220512 Note: This document is a replacement for the previous version at Avoiding MQ cluster problems after Disaster Recovery testing


These are notes intended for people who are planning Disaster Recovery testing with MQ clusters. Planning for DR tests must be done carefully, especially when your system includes MQ clusters. I will not be recommending any particular DR design. Instead I will be explaining some of the implementation details within MQ clustering that you will need to be aware of, to avoid known problems.

EDIT Oct 2023: In the years since I first wrote this, DR solutions using synchronous replication have grown. Now if you see the term 'DR', it is more likely this is something achieved for you by subsystems and layers of code doing synchronous replication. This document might then not be as interesting for you. But it might still repay the effort to read and understand how MQ's clustering state is stored. If at any point in your DR testing you return to disk snapshots even just a few hours old, then what I have written here is probably still for you.

Last update Oct 2023. I wrote the original version in February 2019.

Introduction

You are planning for a Disaster Recovery (DR) test. Your Production network has a cluster of queue managers, serving one or more applications. Your DR test might attempt to replicate equivalents of some or all of these, for a short time, then delete them afterwards.

The MQ cluster internal state data has various timestamps and sequence numbers, which are not normally of concern to an MQ administrator. This document will discuss some of these, in order to explain why care is needed.

Can you give me an executive summary, before we start?

During your DR test, your DR queue managers can send unwanted updates to Production Full Repositories. This can lead to serious problems in Production later. You should therefore consider how to isolate the queue managers that will run in your DR site during your test, to prevent them making unwanted changes to your Production site.

However, in the real world, accidents do happen. If your DR test does make changes to your Production site, these can be overcome by careful use of the REFRESH CLUSTER command. If something happens during your DR test that changes state in your Production site, you must understand the risks, and plan your response.

Background: Updates and sequence numbers

First, some background to help understand sequence numbers.

MQ cluster queue managers communicate between each other using command messages that flow over the same CLUSSDR-CLUSRCVR channels used for application messages. These messages are put by a queue manager to the correct cluster transmission queue and are moved by your normal cluster channels to their intended destination queue manager.

In the simplest configuration this queue is SYSTEM.CLUSTER.TRANSMIT.QUEUE.

These command messages contain updated records for clustered queues, topics and queue managers (represented by their CLUSRCVR channel definitions).

Here is an example from normal operations.

Using the runmqsc command, you make an update to the attributes of a clustered queue. The changed queue record is then sent automatically to the Full Repositories for that cluster. At the same time, the internal sequence number within the queue record will be increased by 1, to provide for sequence number validation checks. Sequence number checks ensure that old information (for example, an update that was delayed by network outages) is ignored, while new updates are accepted. From the Full Repositories the update will then be forwarded to any other interested queue managers.

Background: Restoring a cluster queue manager from backup

Now, let’s think about a backup/restore scenario, and how sequence numbers behave in this scenario.

Imagine the case where you are recovering from a serious failure (hardware, software, administrative) by restoring your queue manager from a file-level backup taken last week. You achieve the file-level restore successfully, and start the queue manager successfully.

All applications start again, and begin to work fine.

The queue manager you just restored hosts a clustered queue. You (or one of your MQ administrators’ team) had updated the queue a few times during the week since the backup, to change its description and some other attributes.

(Changing the description might appear a trivial operation, but like other changes to the cluster queue attributes, an update will be sent to the Full Repositories, together with an increment in the internal sequence number for the queue record. Note that changing an attribute then returning the attribute to its previous value will result in 2 increments to the sequence number).

For the sake of our example, let’s say the administrator had made 6 updates to the queue in the week since the backup. Therefore, this restored queue manager believes that the queue sequence number is 14 (its value a week before), but the rest of the cluster believes that its sequence number is 20 (its value just before the queue manager was “lost” due to system failure).

So, an “old” version of your queue manager has been introduced into your cluster network.

As I said above, all applications appear to work fine, for now.

However, the sequence numbers are out of date with the rest of the cluster in respect of the queue it hosts.

How can I overcome this problem?

You must run REFRESH CLUSTER on this queue manager. Doing this will update all the sequence numbers on the clustered queues and CLUSRCVR channels owned by the queue manager.

I will say a bit more about how this works, in a moment.

Background: But what happens if I forget to run REFRESH CLUSTER?

Several days go by, and you do not run REFRESH CLUSTER on the restored queue manager.

The time comes when our restored queue manager must re-publish the existence of its queue into the cluster (this takes place automatically, normally every 27 days).

It re-publishes a record for the queue, by sending a message to the Full Repositories.

The sequence number it includes within this internal re-publish message is 15 (incremented by 1 over its previous value). However, this is still five less than the other members of the cluster think is the current number (that is, 20).

So, when the re-publish update is received at the Full Repositories, they discard it, believing it to be an old delayed update. They do not forward the re-publish update to the other interested queue managers.

Over the next days and weeks, Full Repositories, and other interested queue managers that know about the queue, will notice that this queue has not been re-published. (It was in fact re-published by its owning queue manager, as we know, but with a lower sequence number than they believe is the current sequence number).

Eventually the other queue managers in the cluster will discard knowledge of the queue, with potentially severe consequences for their putting applications (I will go into a bit more detail about this, in a moment).

In the meantime, an update you make to the queue (even deleting the queue!) will be ignored by the other members of the cluster.

Background: Why do the queue managers discard knowledge of a queue?

All MQ cluster queue managers have a policy that they will discard knowledge of remote cluster queues if they are not re-published for 2 months (actually, more precisely, it’s 60 days) after their expiry time.

If applications are using those queues, then you will see error messages on the queue managers hosting those applications, to alert you to the problem. An example of one of these is included a little lower down this document.

The owning queue manager sets an Expiry Time on each of its clustered queues. It is set to 30 days after their creation or last update. A separate Expiry Time is stored on each clustered queue.

Full Repositories and other interested queue managers learn about this queue manager’s clustered queues, and remember the Expiry Time it has set for each.

Let’s call the day of creation or last-update “day 1”.

At day 27, the owning queue manager re-publishes the queue to the Full Repositories, with an incremented sequence number. Normally, the Full Repositories immediately forward the same update to all other interested queue managers. The interested queue managers therefore normally receive the update within a few seconds.

So, in normal circumstances, all queue managers receive regular updates for each queue. And the sequence number increases by one, every 27 days.

But what about the case of our restored queue manager? The Full Repositories discarded its update, believing it to be an old delayed update. So, at day 27, none of the interested queue managers receive the update. We turn to consider this problem case again, now.

At day 30, when no re-publish update has been received at a queue manager whose applications have recently put messages to the queue, the queue manager will start to complain with AMQ9456 error messages in its error log. Here is an example of what you would see:

AMQ9456
MESSAGE:
Update not received for queue Q1, queue manager QMA from
full repository for cluster MYCLUS.
EXPLANATION:
The repository manager detected a cluster queue that had been used sometime in
the last 30 days for which updated information should have been sent from a
full repository. However, this has not occurred.
The repository manager will keep the information about this queue for a further
60 days from when the error first occurred.
ACTION:
There are several possible responses:
1) There is a long-running problem with the local queue manager’s CLUSRCVR in
cluster MYCLUS. If this is true, then correct the problem urgently, to
ensure that updates for the cluster are received.
2) There is a long-running problem on the remote queue manager’s CLUSSDR in
cluster MYCLUS. If this is true, then correct the problem urgently, to
ensure that updates for the cluster are sent.
3) Check that the repository manager on the remote queue manager has not ended
abnormally.
4) The remote queue manager is out of step with this queue manager, potentially
due to a restore of the remote queue manager from a backup. The remote queue
manager must issue REFRESH CLUSTER to synchronize with other queue managers in
the cluster.
5) The remote queue manager is out of step with this queue manager, potentially
due to a disaster recovery exercise in which a replacement queue manager with
the same CLUSRCVR channel name was created, was run for a while, then ended.
If this has happened, then the remote queue manager QMA must now issue
REFRESH CLUSTER to synchronize with other queue managers in the cluster.
6) If the above items have been checked, and this problem persists over several
days (causing repeats of this error message in the local queue manager’s error
logs) then contact your IBM support center.

I want to emphasize: the above error message is saying there is a serious problem!

If you see these messages within a month of restoring an old queue manager from backup, or running a DR exercise, then you have probably suffered this problem scenario, or a variation of it.

If I see AMQ9456 errors, have I suffered this backup/restore problem?

Not necessarily.

There are also other problem scenarios, that might have caused the error. These are listed in the error message. I’ll pick out two examples that could be easy to fix:

  • Cluster channels from the owning queue manager to the Full Repositories were down (and remained down for days) when the update was sent, so the update is still sitting on the owning queue manager’s transmission queue.
  • Cluster channels from the Full Repositories to one of the interested queue managers were down and remained down for days. If this happened for the 3 days between day 27 and day 30, then the update would be sitting on the Full Repositories’ transmission queues.


In either of these two examples, the interested queue manager would start to write AMQ9456 error messages, but the error would clear when you got the channels running (for example by fixing a network problem).

So, this error message is certainly not always a sign of a deep-seated problem.

See the numbered list of other possible known reasons, mentioned in the AMQ9456 error message.

What happens if I don’t notice these error messages, and REFRESH CLUSTER is not run?

We return to considering the problem caused by the backup/restore of an old queue manager without the necessary REFRESH CLUSTER being run.

Let’s assume that nothing is done to correct this situation.

60 more days go by, during which time the AMQ9456 message is written once per hour.

I hope you would have something in place to alert you to errors, so that they are not missed. But in the real world, accidents do happen, so I need to continue to tell you what happens next.

At this point, 60 days after the queue’s Expiry Time, because no acceptable re-publish update was received, the queue is marked as “deleted” in the cluster caches of the Full Repositories and the other interested queue managers.

Effectively the Full Repositories and other interested queue managers have discarded their record of the queue.

At that point, even changing the queue’s attributes on the owning queue manager does not help. This is because the single increment to the attribute is not enough to make it greater than 20.

(As an aside, it WOULD be sufficient in this example of mine, to change the queue’s description a few times to bring the sequence number to 21. But note that small sequence numbers such as 14 and 20 that I have used for the sake of simplifying my example are not seen on queue managers created since MQ v8, which was released in 2014. This is because MQ was changed in that release to use a Unix Epoch Time from near the time of queue manager creation, or the time of the most recent REFRESH CLUSTER command, for the sequence numbers of any cluster queue object, even upon first creation of the object. At the time of writing, Unix Epoch Times have recently passed 1.55 billion).

The remotely defined clustered queue just became “invisible” to the Full Repositories and all other interested queue managers.

What happens to the applications?

At this moment, the risks to your Production applications using that remote queue will be severe. One of the following reason codes might be returned from MQOPEN, MQPUT or MQPUT1:

  • 2085 MQRC_UNKNOWN_OBJECT_NAME
  • 2041 MQRC_OBJECT_CHANGED
  • 2082 MQRC_UNKNOWN_ALIAS_BASE_Q
  • 2270 MQRC_NO_DESTINATIONS_AVAILABLE


The situation will not improve by restarting the applications, because even the Full Repositories are under the false impression that the queue does not really exist. So, when the queue managers hosting these applications send their queries to the Full Repositories, they will reply that the queue does not exist.

(This, of course, assumes that there are no other instances of your queue name elsewhere in the cluster. If there are other instances of the queue name, hosted on other queue managers, then they will continue to be chosen as destinations, and your putter applications will not see an error. On the assumption that those other queue managers were not restored from backup at about the same time, they will not be susceptible to the same issue. But you will have “lost” one of the instances of the queue, which would cause other downstream issues such as increased load on the other queue managers, their applications or their network connections).

Again, if this situation has happened to you, then running REFRESH CLUSTER on the restored queue manager is the way out of the problem.

Why does REFRESH CLUSTER help resolve this problem?

There’s a quite a lot of processing within REFRESH CLUSTER, but the most important part for our consideration is the update to the sequence numbers.

REFRESH CLUSTER updates all the sequence numbers on the clustered queues, topics and CLUSRCVR channels owned by the queue manager where you run the command.

The sequence numbers will be set to the current Unix Epoch time, as given by the local system clock. The Unix Epoch time is the number of seconds since 00:00:00 on 1 January 1970 UTC. So, this number increases once per second, forever.

When the queues and CLUSRCVR channels have their new sequence number, they are re-published immediately to the Full Repositories.

The new sequence number (now set to the current Unix Epoch time) will be higher than the sequence number the other queue managers have stored, so they will not discard these updates.

So, by running REFRESH CLUSTER on the restored queue manager, you will have allowed its updates to be accepted again.

This document is about DR testing. So far, you’ve only talked about backup/restore.

That’s right. But it is highly relevant background information, which I hope will soon become clear.

Let’s now focus on DR, specifically.

DR scenarios typically include one or both of these two elements:

  • Restore of each queue manager from a backup, as a first step.
  • Creation of a new copy of each queue manager, using similar scripts to those used to create the Production site queue managers.


Also, your plans for when DR is used “for real” might possibly include:

  • DR is online with connectivity to the normal Production site, and DR is now hosting only a critical subset of the services.
  • DR is separate from the Production site (maybe assuming the Production site is completely lost and off the network).
  • Or some other scheme. Remember, this is not a full guide to designing your DR approach.


Lastly, approaches will vary over the naming of queue managers and queues. The DR site might use:

  • The same names for queue managers, CLUSRCVR CONNAMEs, CLUSRCVR channel names and queue names, as Production.
  • Different names. For example, by embedding the letter P in the production queue manager, channel and queue names. And using the letters DR in the disaster recovery site equivalents.


By using different CLUSRCVR channel names (and queue manager names, and maybe even queue names!) for everything in your DR site, this will reduce risk, assuming you cannot isolate the DR site at a network level.

Also, just be aware: although you may intend to use different names, you are at greatest risk in these situations:

  • After you restore file-level backups, or
  • When you are using the same runmqsc scripts as Production to populate new DR queue managers, or
  • You intended to use different channel names, CONNAMEs etc. in DR, but a mistake was made when preparing your DR runmqsc scripts, so that they unintentionally use (some of) the same names as Production.


In these scenarios, you might run a queue manager in your DR site with CONNAMEs and channel names the same as Production, even for a few moments, when this was not your intention.

If that happens, the DR queue managers will probably send information to Production queue managers that you did not intend.

Background: DR queue managers can send unwanted updates to Production Full Repositories

I’ll need to paint another specific problem scenario, for the sake of example.

Maybe you want to keep Production working throughout your DR test, and cannot isolate the DR site from your Production network. Maybe also your DR queue managers will be restored from file-level backup, and will use the same CLUSRCVR channel names as in Production.

Caution: This is a situation where you are in some danger, and it would be preferable to avoid it by isolating your DR site from Production, or using different names. But let’s assume you cannot do that, or a mistake happens.

This scenario leads to problems, particularly when you have run REFRESH CLUSTER in DR, thus giving objects high sequence numbers. (Remember, running this command is the right thing to do when restoring from backup, assuming the DR network is separate, or the names are different from Production! So, it is to be expected that this command will be run during a DR test, and when doing DR for real).

But in our scenario here, a DR queue manager where you have run REFRESH CLUSTER will send update messages with new sequence numbers to the Production Full Repositories. And the Production Full Repositories cannot tell that you did not intend those updates to be made.

Your DR queue manager has caused an update to the sequence numbers on some records held by the Production Full Repositories (and other interested Production queue managers), which you did not intend to happen.

After your DR test has ended, you delete all the DR queue managers.

And the Production queue manager continues to run, unaware of what the DR copy of itself has done.

That Production cluster queue manager will now suffer the sequence number problem I have already described for the simple backup/restore case. Its updates about itself and its queues will be ignored by the Full Repositories. You then see the AMQ9456 errors from other Production queue managers after 30 days, and all the rest.

Will your DR site have network connectivity to your Production site, during the test?

You should plan how you will avoid the DR queue managers sending new information into the Production queue managers that updates the sequence numbers on objects.

The “cleanest” solution is that the DR and Production networks are isolated from each other.

That is, no route between them, enforced by the IP routing tables.

Then you can rest in the knowledge that it is not possible for the DR queue managers to send anything to the Production queue managers.

But if a mistake happens in the networking setup, and a DR queue manager with the same identity did contact the Production Full Repositories and updated a sequence number, then you have a problem to solve.

To correct this problem, you would need to:

  • End the DR copy of your queue manager, and prevent it from restarting.
  • Run REFRESH CLUSTER on the Production queue manager, to give its objects new sequence numbers (overriding those inserted by the DR copy of the queue manager), so that it may continue to work within the cluster.

Will you end your Production queue managers while performing the DR test?

Maybe you cannot isolate your DR network from Production.

If in your business you can end all your Production queue managers for the duration of the DR test, this is a simple (and very effective!) way to ensure they receive no updates from your DR queue managers.

Then, when you restart the unaltered Production queue managers (having first ended all of your DR queue managers, of course!) they will continue exactly as before the DR test.

What QMID will be used by your queue managers in DR?

The QMID is an attribute that matters when queue managers are restored.

You may have heard of the QMID before. It is mentioned in the output from the following runmqsc commands:

  • DISPLAY QMGR
  • DISPLAY CLUSQMGR
  • DISPLAY QCLUSTER


The QMID is generated when you run crtmqm, and it stays constant within the QMGR attributes forever.

Therefore, a queue manager that is restored from a file-level backup has the same QMID as the time crtmqm was run, to create that queue manager originally.

So, what QMID will your DR queue managers have?

  • If your DR queue managers are file-level restores from Production backups, then they have the same QMIDs as in Production. And, until you alter them, they will have the same CLUSRCVR channel names.
  • If you ran new crtmqm commands to create your DR queue managers, they will have new QMIDs. If you run the same runmqsc scripts to create CLUSRCVR channels etc. then the channel names will also be the same.
  • If you give your queue managers and CLUSRCVR channels different names in DR, then you can skip all further mentions of the QMID, as it is not relevant to your situation.


Here are three backup/restore approaches:

Backup/restore at the file level. You use operating system “copy” commands to copy the files. For data integrity the queue manager must be ended while the file-level backup is taken. When restored to a different machine in your DR site, a queue manager reinstated in this way will have the same name and the same QMID as the one in your Production site.

Backup/restore of definitions, added to a recreated queue manager. You keep daily (or maybe weekly, or monthly) dumps of the queue manager definitions using dmpmqcfg. To reinstate the queue manager, you run crtmqm in your DR site with the same parameters as the original queue manager, and then use runmqsc to add the object definitions to the newly created queue manager. When recreated in this way, the queue manager will have the same name but a different QMID compared to your Production site.

“Backup queue manager”. This is a particular method of creating and maintaining a DR queue manager, relying on linear logging, and is documented in the Knowledge Center. The method used to create one of these means that it is equivalent to the “Backup/restore at the file level” section above. So, a queue manager reinstated in this way will have the same name and the same QMID as the one in your Production site.

Why does it matter whether the QMID is the same in DR?

Let’s describe a scenario where the QMID matters, in the context of DR tests.

Maybe you want to keep Production working throughout your DR test, and cannot isolate the DR site from your Production network.

Maybe you also want to use the same CLUSRCVR channel names in DR as are used in Production.

Caution: Just to say it again, this is a situation where you are in some danger, and it would be preferable to avoid it by isolating your DR site from Production, or use different names. But let’s assume you cannot do that, or a mistake happens.

If there was connectivity from DR to Production, a queue manager’s CLUSSDR channel running in DR could connect to the listener of a Full Repository queue manager in Production and send cluster state update messages to it via its CLUSRCVR channel.

The precise behavior on the Full Repository is quite different, depending upon whether your DR queue manager had the same QMID as the same named queue manager in Production, or not.

As I mentioned above, the QMID of the DR queue manager depends upon how you created or reinstated it.

Here’s what happens in the two cases.

If the QMID in DR is the same. You’ve reinstated the queue manager in DR from file-level backup, so its QMID is the same as Production.

When this DR queue manager sends updates to the Full Repositories about itself, or about its queues, it will be using the same QMID and the same CLUSRCVR channel name (I’m assuming you didn’t change it, remember) as the Production queue manager of the same name.

The Full Repository will treat the updates as though they came from the Production queue manager of that QMID. If they have higher sequence numbers (for example, after you run REFRESH CLUSTER in DR), these updates will override your normal Production objects.

(Why doesn’t the Full Repository inspect the source IP address to distinguish that this is a different queue manager? It does not do this, by design. The Full Repository ignores the source IP address, as it assumes you might at any time validly want to move your queue manager to new hardware, or replacement hardware, maybe even with a different IP address).

The problems that follow from this same-QMID scenario are fully discussed above, already.

If the QMID in DR is different. You’ve created a new queue manager in DR using a fresh crtmqm command, so its QMID is different from the same-named queue manager in Production.

When this DR queue manager sends updates to the Full Repositories about itself, or about its queues, it will be using the new QMID, but still the same CLUSRCVR channel name to identify itself in the MQ network.

The Full Repositories will notice the new QMID, and will treat the new queue manager separately from the “old” version of the queue manager. In fact, they (and all the other interested queue managers in the cluster) will store entirely separate records for the new queue manager and its queues.

Helpfully, the sequence numbers held in the records for the newer queue manager do not override the sequence numbers in the similar records for the older one. This is true because they are separate records.

Separate QMID, separate sequence numbers. So, no confusion, right?

Well, yes there is come confusion. It is never a good idea to try to have two queue managers of the same name in your networks. It only leads to misunderstandings, for people and applications! Seriously, don’t try to do it.

However, MQ’s cluster cache management routines do hold separate records relating to queue managers with the same name but different QMID, to allow for genuine cases where a replacement queue manager is introduced following (say) hardware failure.

But within any network of queue managers, there can be only one CLUSRCVR channel in use with a particular name.

You might have thought that uniqueness of the channel name was a secondary concern.

But this restriction is necessary, to avoid ambiguity on any queue manager that wants to send a message.

(The sending queue managers use CLUSSDR channels for this, of course. And the CLUSSDR channel has a name. And that channel name must uniquely identify a recipient, from the perspective of the sending queue manager. Imagine what confusion there would be if we were able to have two channels of the same name, known throughout the cluster, hosted on two separate queue managers at the same time! In short, the mechanism for sending messages via cluster transmission queues is what drives the need for uniqueness of the CLUSRCVR channel name).

So, if your DR queue manager has a different QMID, but is using the same CLUSRCVR channel names as Production, then the more recent channel record is marked as “in use” by the Full Repositories. Similarly, the channel for the previous QMID is marked “not in use”. It is kept for a while without being deleted, but is not used for anything.

The Full Repositories and other queue managers that know about your queue manager (possibly the whole of the cluster!) will immediately be told about the newer “in use” definition for this channel name / queue manager, and will mark the older one “not in use”.

What problems happen after this?

The queue managers in the Production cluster now think that the DR queue manager, with its separate QMID, is the one “in use”.

That is, they think it has taken over from the previous QMID as the genuine owner of the CLUSRCVR channel it has just sent to the Full Repositories.

The new QMID from your DR test – and, importantly, its “in use” status for the CLUSRCVR channel name – now persist in your Production cluster! This was not what you intended.

The precise effects from this would depend on your configuration, but Production problems are inevitable.

(The problems might even be short-lived, if the Production queue manager continues to run, and its channel re-asserts itself to the Full Repositories as the “in use” one. If a CLUSRCVR channel re-asserting itself seems odd behavior, remember that this mechanism is meant for a takeover scenario where the previous queue manager or its hardware is permanently lost, and therefore it cannot re-assert itself!)

If you have the same queues defined in Production, then application messages from Production will go to your DR site. That’s bad, because (I would assume) you don’t have your normal Production applications working in DR to receive those messages and process them.

If you don’t have the same queues defined in Production, then application messages might suddenly have nowhere to go, and either:

  • will end up on Dead Letter Queues, or
  • your applications will begin to fail in their MQOPEN, MQPUT1 or MQPUT calls.


Either of these scenarios is very bad for your Production applications!

Recovering from this situation would not necessarily require a REFRESH CLUSTER command to be run on the normal Production queue manager, though this would indeed solve it.

First, though, you must shut down that DR queue manager that has inserted itself and its new QMID into the Production cluster.

Then, you can visit the normal Production queue manager, and alter its CLUSRCVR definitions (for example, simply to alter their Description attributes). Doing this will make the Production queue manager re-assert its channels as the “in use” ones.

Just to repeat, you must shut down the DR copy. If you were to leave the DR copy of the queue manager running, it would probably re-assert itself at some point in the near future, and give the same problems all over again.

Running a full REFRESH CLUSTER on the Production queue manager is not needed in this precise scenario, but I would still recommend running it, to ensure all the genuine objects in Production are given new sequence numbers.

Lastly, if you wanted to remove all remaining information about the DR queue manager that inserted records with its new QMID, you can visit the Full Repositories and use the RESET CLUSTER command. (Note: this is very different from the REFRESH CLUSTER command; don’t get them confused!)

Are there other times I would need to use REFRESH CLUSTER?

In this document I might not have described your situation, precisely. But hopefully you now understand the principles of the MQ implementation enough to understand what is happening to sequence numbers in your DR test.

If, after your DR test you are using an “old” cluster queue manager in Production, compared to what the rest of the queue managers expect, you are now aware that it might be failing in its attempt to send its regular updates to the Full Repositories.

If its records have sequence numbers that are “old” compared to those held by the rest of the cluster, then the problem exists.

In such a situation, running REFRESH CLUSTER is necessary, to cause your Production queue manager to re-assert itself as the correct and current version.

I heard there are downsides to running REFRESH CLUSTER

Do read the sections of the Knowledge Center that describe the REFRESH CLUSTER command. Some of the pitfalls are mentioned there.

REFRESH CLUSTER causes extra processing on the machine where it runs, and on the cluster Full Repositories, which can make temporary application errors and timeouts more likely for a while after running it.

A section has been added to the Knowledge Center recently, describing failure return codes that applications might receive while it runs. Search for “Application issues seen when running REFRESH CLUSTER”.

However, if your queue manager has old sequence numbers, you must run REFRESH CLUSTER, as there is no alternative.

If the command is not run, your queue manager and its queues will fade away from the cluster, as I have described.

Is there an alternative to running REFRESH CLUSTER?

REFRESH CLUSTER is the simplest way to fix these issues, but it does come with some risk, as I have said.  You might want to find out if your configuration can be corrected without resorting to that command.

Remember the simple scenario I painted at the beginning, in which:

  • We made six changes to one clustered queue that then flowed to the Full Repositories, causing them to store sequence number 20 for that queue.
  • We restored the queue manager from backup, causing it to have a sequence number of 14.
  • Future automatic re-publication of the queue from its owning queue manager on a 27-day schedule were ignored at the Full Repositories because the sequence number was less than 20.
  • Other queue managers in the cluster began to complain with AMQ9456 error messages, saying the queue was not being re-published.


In this scenario there is a workaround that is focused on just the one queue where you have the problem.

The workaround is to change the Description on the clustered queue, on the queue manager that hosts it.  In order to be effective, you would have to change the Description as many times as needed to get its sequence number (in the example scenario) to 21.

This workaround could be very much quicker and easier than running REFRESH CLUSTER.  But on the other hand, you are responsible for working out whether you have found all queues that might have been affected by the backup/restore.  At the time of writing, the only way to discover internal MQ clustering sequence numbers is to use non-publicized internal MQ debug tools.

Are there any other techniques for reducing risk?

The one that springs to mind is to use the -ns switch on the strmqm command.

The description of this switch, from the strmqm Knowledge Center page at time of writing, is:

Prevents any of the following processes from starting automatically when the queue manager starts:

  • The channel initiator
  • The command server
  • Listeners
  • Services


This parameter also runs the queue manager as if the CONNAUTH attribute is blank, regardless of its current value. This allows unauthenticated access to the queue manager for locally bound applications; client applications cannot connect because there are no listeners. Administrative changes must be made by using runmqsc because the command server is not running.

I would advise using this form of the command whenever you want to perform administration on a queue manager just after restoring it from backup, during which time you do not want it to start outbound or inbound channels.

Footnote: topics and channels

The things I have said about sequence numbers on queues are applicable to other objects on which sequence numbers are stored. In particular: topics and CLUSRCVR channels.

Topics and CLUSRCVR channel definitions also have Expiry Times, which operate similarly to queues.

Within this document I did not want to say “or topics, or CLUSRCVR channels” all the time, as this would have made everything much more difficult to read.

Summary

Preventing network connectivity from DR to Production is an ideal way to avoid any risk.

However, network connectivity might be needed, for reasons you cannot avoid. Or maybe making the needed changes to your networking might be difficult (or expensive) to achieve. Still, it might be worth the cost, to avoid problems in Production.

Another way of removing risk is to end all your Production queue managers during your DR test, if your business will allow this.

Yet another way of reducing risk is to use different names for queues, channels, etc. (In fact, using different names for different sites that interconnect is a good idea in any circumstances).

As a last resort, if mistakes are made during your DR test, allowing updates to be made to your Production queue managers, then the REFRESH CLUSTER command issued to the right queue manager(s) after the test is complete can recover a good state.


#automation-featured-area-1

2 comments
32 views

Permalink

Comments

Mon October 23, 2023 11:53 AM

Thank you Andy for taking the time to comment. I made an edit near the top to attract attention to the important point you raise.

Sat October 21, 2023 06:15 AM

Martin,

In these notes you seem to assume that DR testing involves recovering queue managers from backups. I think it would be sensible to point out that none of these issues should arise in the case of synchronous replication to a DR site.

There are now several candidates for synchronous DR replication, ranging from independent sychronous replication by the storage subsystem, through RDQM (which is essentially a flavour of the above) to native HA (not currently deployable in many scenarios). 

I agree that the vast majority of DR scenarios will currently involve asynchronous replication of some form (e.g. backup/restore) but I think it would be sensible to at least mention the alternative.

best wishes

Andy.