MQ Poison Messages & Backout Queues
In this post Connor Smith and I will give a brief overview of Poison Messages, and how the various IBM MQ clients handle them and what you need to do as an application writer.
A poison message is a message which is unable to be processed by a receiving application. The following are some possible reasons for this…
- The IBM MQ client code itself fails to handle the message due to e.g. a bad header in the message.
- The message causes an error in the underlying system e.g. in the XML parser.
- The message has bad data in it which the application is not expecting and causes an exception in the application.
The above examples show that the error caused by a poison message is not necessarily something that the application itself should have foreseen.
Once a message has been found to be poisoned the next question is, what to do about it. If the message is GOT from the queue within a transaction and the failure happens while the message is being processed within the same transaction then the message will automatically rollback on to the queue and be available to be re-processed.
The figure above shows the result of just such a transaction rollback. The message has been removed from the queue, processed until an exception occurs and then rolls-back onto the queue ready to be read again. There are a number of caveats to this method…
The message must be read from the queue under the transaction.
The message must be processed under that transaction.
The exception that causes the message to fail processing must not be caught. If the exception is caught then it is up to the coder to figure out what to do next (options include throwing another exception and causing the message to rollback !)
This mechanism works fine if we have a failure that won’t happen again. However, if the message is poisoned such that it fails to get processed every time it is read from the queue then we need to think about a way to stop processing the message and deal with it differently. This could be handled by a try-catch block in some circumstances, but let’s assume that we failed to put those in correctly or have chosen, or been forced, into using other methods.
Backout Count and Thresholds
Each time a message is backed out to the queue, the message attribute BackoutCount field (in the MQMD) is incremented by MQ itself. This means that the next time the message is GOT from the queue the BackoutCount field will be greater than zero.The next thing to consider is what to do with the information that the newly received message has been attempted to be processed before but failed and was rolled-back.
IBM MQ gives the administrator the ability to set a field on the Queue called the Backout Threshold. This field is a hint to the developer so that they can decide what to do if the BackoutCount of a newly GOT message is equal to or greater than the BackoutThreshold of the queue that they just read the message from. Yes, it’s "just a hint", we’ll see why in a minute. Also, be aware that the default threshold is zero i.e. that the threshold is not set and this value cannot be used to ascertain whether the message has been rolled back too many times.
IBM MQ gives the ability for the MQ administrator to define, on the queue, which other queue they expect the developer to place any messages on, if the message has failed their back-out threshold test.
So, if we put those pieces together, we can see that the developer is meant to read a message from a queue, they are meant to look at the message’s backout count. They should then compare that with the queues backout threshold property (which they should have acquired earlier in the code). If the message’s backout count equals or exceeds the queue;s backout threshold then the developer should retrieve the name of the backout queue, as defined on the queue the message was read from, and PUT the message onto that backout queue.
NOTE: If the backout queue has not been set on the original queue, by the administrator then the message should be PUT to the Dead Letter Queue.
This sounds like a lot of work and it is ! Which is why lots of confusion lies in this area. However, although this was the original semantics of IBM MQ, newer messaging protocols and solutions have come along which have made this job a little easier for the developer.
Backout handling in JMS
There is a lot of detail around this subject that I only summarise here but look at the IBM info center for the intricacies (Handling Poison Messages in IBM MQ classes for JMS) JMS Clients behave differently than some of the original MQ clients because JMS offers a level of automation. MessageConsumer and ConnectionConsumer classes automatically re-route messages if the backout count has surpassed the defined backout threshold. As before, they follow the protocol of either putting messages onto the defined backout queue or the DLQ if the Backout Queue is not defined. With other types of application, as we have seen, this logic would need to be implemented manually. It’s “interesting” to note that if the message cannot be put to the defined backout queue and is not able to be put to the DLQ then the message is discarded. So, beware !
WebSphere Application Server (WAS) and IBM MQ Application ASF
Paul Titheridge has done an excellent write up of how How WAS handles rollbacks and I won’t repeat it here. However, in summary, He concludes that you really need to have both the backout threshold and backout queue names defined on your original queue to avoid poison messages being repeatedly read in and creating a loop - just as you do for JMS.
IBM Integration Broker (aka Message Broker) and Poison Messages
IBM Integration broker is no exception to the rule here. If you are using a JMS node then the JMS semantics apply. however, if you are using an MQ node then it’s your responsibility, as the message flow writer, to ensure that you check the backout thresholds. IBM have provided a sample error handler flow.
In this article we have described the basics of how backout queues and thresholds are implemented within IBM MQ and MQ clients. Originally, the paradigm was that it was up to the developer to check the various attributes and properties to make sure that they were not processing poison messages too many times and getting into loops. However, over time, new protocols and solutions such as JMS and Application servers have added convenience into this checking which relieves the coder from much of the burden. But, it is very important to note that the new methods are only as good as the MQ queue definitions of backout queue and threshold, which the new methods still depend on.