MQ

 View Only
Expand all | Collapse all

Message Grouping via Transmissionqueuemanager not working?

  • 1.  Message Grouping via Transmissionqueuemanager not working?

    Posted Thu March 16, 2023 02:57 AM
    Edited by Sebastian Wilk Thu March 16, 2023 06:28 AM

    Hello dear MQ people,

    we have the following setup at hand:

    A client is sending messages with message grouping being connected to a transmissionqueuemanager, which then relays the messages to the target queuemanager via SCTQ into a local queue.

    The problem arises after the messages have been put to the SCTQ and are moved to the local queue: The messages are committed one by one, which leads to a potential problem when the local queue ends up full and no complete message group has been put into the local queue. We tested and verified this, as the reading application will only read completed message groups, which then leads to a softlock.

    Our assumption was, that the messages can only be committed once the last message in group has been transferred, but this is not the case here.

    Here's a simple illustration:

    Is this intended behaviour or are there ways to configure it so that the messages are only committed as a full group when using the above mentioned setup?

    Kind regards

    Sebastian



    ------------------------------
    Sebastian Wilk
    ------------------------------



  • 2.  RE: Message Grouping via Transmissionqueuemanager not working?

    Posted Thu March 16, 2023 11:19 AM

    The queue manager knows if the sequence of messages (and segments) in the group is complete,  regardless of the order in which those messages actually arrive.

    With message groups (and segments) you can choose how much help the queue manager gives you in managing the reassembly of messages and whether messages in incomplete groups can be consumed.  The option MQGMO_ALL_MSGS_AVAILABLE indicates whether all of the messages in the group must be available before any message in the group is returned.

    Channels move message (and segments) individually, they are not concerned with the completeness of the group (for example the channel has no knowledge of how the message will be consumed).

    You state "Our assumption was, that the messages can only be committed once the last message in group has been transferred", if this was the case the full set of messages would still need to be on the target queue, and hence MAXDEPTH would need to be high enough to permit this. 

    So the obvious answer to your question is 'no there is no option to only transmit/commit whole groups', the more intersting thing is what you were actually hoping to achieve if there were ?

    What am I missing ?? For example do you have multiple message groups accumulating on the target queue and were hoping not to need MAXDEPTH to be big enough for all of the groups concurrently (not possible I'm afraid). What issue do you anticipate through increasing the max depth on the target queue ?



    ------------------------------
    Andrew Hickson
    ------------------------------



  • 3.  RE: Message Grouping via Transmissionqueuemanager not working?

    Posted Thu March 16, 2023 12:16 PM

    Your statement is correct in regards to the maxdepth, and this is also the issue we ran into.

    Our local queue has reached the maxdepth but only incomplete message groups were sitting in it, thus the reading application could not consume any of them, resulting in the putting job to park the messages in the SCTQ.

    The takeaway we are looking for here is this scenario:

    We have 700.000 messages to be put into the target queue, the queue has a capacity fo 50.000

    How can we make sure that there are only complete groups in the target queue so that it does not lock itself up?

    The obvious choice is to increase the maxdepth (which we already did) but it sounds like a weird design that sender and receivers do have to respect the message group but intraqueuemanagercommunication is able to put and commit a message one by one? This could always lead to the abovementioned situation if the messages exceed the maxdepth of the targetqueue.

    I'll have a look at the MQGMO_ALL_MSGS_AVAILABLE option.



    ------------------------------
    Sebastian Wilk
    ------------------------------



  • 4.  RE: Message Grouping via Transmissionqueuemanager not working?

    Posted Thu March 16, 2023 12:39 PM

    The messages themselves contain no state indicating if they were committed as an entire group, nor any state indicating if they must be consumed as a complete group, that is entirely an application decision. As such the channel can't possibly know what messages or segments were put as part of a complete group/message, yet alone know how the messages would be consumed. Even if there were an option requiring the sending channel to only send complete groups, that would violate MQ message ordering guarantees. Why are you concerned about a depth of 700000 messages ?

    One of the difficulties in dealing with message groups and segmented messages is how to deal with 'lost' messages/segments. Are these messages persistent? if so then the most common reason for a long term incomplete group woud be if messages are finding their way onto the DLQ. 

    You state "it sounds like a weird design that sender and receivers do have to respect the message group", but that is not true. Both the senders and receivers can choose whether to only deal in complete groups or not. There's enough rope in the message group and segmentation options with which you could hang yourselves several times over. In the Hursley lab some of the options used to be refered to as "for grown ups only", and you have to be extremely careful if you start to manually fill in the group and segment status. The more friendly MQPMO/MQPMO options do NOT require you to put and receive complete messages, but even these options need you to think very carefully about how any errors would be handled. 



    ------------------------------
    Andrew Hickson
    ------------------------------



  • 5.  RE: Message Grouping via Transmissionqueuemanager not working?

    Posted Fri March 17, 2023 03:14 AM
    Edited by Sebastian Wilk Fri March 17, 2023 03:14 AM

    I'm not concerned with 700k messages, as this was an example, which reflects but a fraction of what the bigger jobs are sending. And the numbers are increasing every year.

    The problem remains, as far as I understood your comments, as soon you use a transmissionqueuemanager, the groups will be 'split' and the messages will be put one by one due to the channel not knowing/looking at the message groups/last message in group flags etc? If that is the case that is an situation e have to work with.

    Message order is not of importance in our case, but complete groups are, thus we are exploring what options we have other than blowing up the queue to a size that would allow to fit all messages at once to make sure the message groups will be completed at some point.

    Our messages were parked in the SCTQ because the target queue was filled with incomplete groups, preventing the the getting application to read any further messages, no message was put into the DLQ, but the queuemanager behaved a bit funky during the ordeal.

    The sending applications does only commit full groups, and the receiver does only read full groups, so our 'problem' (if it is one) is the transport between the transmissionqueuemanager and the target queue on the targetqueuemanager.

    We cannot allow either side to deal with incomplete groups as there are dependancies that require the entire package to be there before further processing the data.



    ------------------------------
    Sebastian Wilk
    ------------------------------



  • 6.  RE: Message Grouping via Transmissionqueuemanager not working?

    Posted Fri March 17, 2023 06:26 AM

    Sebastian,

    I'm a bit confused by a couple of things in your reply:

    1. If there's a DLQ, and the target queue was full, why were the messages not put to the DLQ ? With the messges "parked in the SCTQ", what was the state of the appropriate channels ? 
    2. If you've got numerous groups passing concurrently and you're only ever going to consume messages from whole groups then how could you possibly avoid needing maxdepth to cover the total number of messages in the possible set of concurrent groups ?
    3. Can you expand on what you observed to lead to to state that "the queuemanager behaved a bit funky during the ordeal" ?
    4. Can you confirm that the channels being used have the default batching options ? You state the messages arrive "one by one" which would be  bit of a surprise. When the sending application commits the group then it's likely that those messages would be batched together (oldest available message is sent first, messages not avaiable until committed, but "oldest" reflects put time rather than commit time) so I'd expect them to largely arrive 50 by 50.

    How deep would you need to set maxdepth to be comfortable that you wouldn't have this issue, now or in the forseeable future ? (and how big are the individual messages).



    ------------------------------
    Andrew Hickson
    ------------------------------



  • 7.  RE: Message Grouping via Transmissionqueuemanager not working?

    Posted Fri March 17, 2023 07:54 AM
    1. The channel(s) (it was only one in that regard) was running, why the messages did not go into the DLQ is a mystery to us as well. They were sitting in the transmit queue, until we made space on the target queue, then the transmission worked as usual.
    2. Since we receive complete groups from the previous transaction from the sender, only complete groups make it into the SCTQ, which is why I am so adamant about the whole "why does the queuemanager send single messages instead of groups". How large exactly is not really capped since the number of clients and their data is continuously increasing. What we know so far is that the biggest estimated job will be around 14m messages (which will be processed while being written so the actual limit needed would be lower from a more realistic standpoint, but with a lot of possible "ifs" that would be the theoretical limit. While not absurdly high, having that many messages lying is quite a bit of a pain in the bottom to move around in a timely manner.
    3. The funky behaviour was that the queuemanager was semi-available, most likely due to the reading application hogging all the resources trying to check all the messages and group ids in order to find a complete one. The other queues were not affected by it, but the queuemanager itself was very clunky, MQSCs and other commands were not completed at all or with a massive delay (30 seconds or more) which is very unusual. The actual load of the server was not higher than usual, both memory and cpu were sitting at comfortable vaues no more than 60% tops.
      Stopping the reading client processes allowed the queuemanager to behave normal again, we then provided smaller batches of messages in order to not hazard the other messaging tasks.
    4. I am not sure if I am looking at the parameter you have mentioned, I checked the cluster-sender from transmission to target queuemanager (assuming that this would be the culprit) I can see the value of 50 in the field

      I can tell you that much that when we tested it after the problem, we created groups of 8 messages and a maxdepth of 30, he ended up putting 24 without a problem and then clogged the remaining 6 "empty spaces" on the queue with an incomplete group, and then put the remaining 2 on the DLQ (which should have happened in the original problem as well)
      I suppose we can try and play with the parameter but it wouldn't really affect the grouping itself.

      As mentioned before, the theoretical limit (as of today) would be 14m, but that will be going up in the future, we could probably do with a limit of 3-5m right now. The messages itself can range from a few bytes all the way to 32 MB (images, TIFFs etc)


    ------------------------------
    Sebastian Wilk
    ------------------------------



  • 8.  RE: Message Grouping via Transmissionqueuemanager not working?

    Posted Fri March 17, 2023 08:42 AM

    In response to your points above:

    1. You should really raise a ticket wth IBM about this, the channel should not stall just because one target queue is full. A channel stalling is more consistent with the discussion under point 3 below, but it would be a great coincidence for this to happen at exactly the time the queue became full.
    2. We've already discussed this, the channel sends the messages in the order prescribed by the MQI. In your specific case you don't care about message order, but MQ does specify a specific order and it's following that specification. There would be loads of other concerns with sending groups contiguously (for example a group of millions of messages could lock out higher priority traffic).
    3. Some work was done on the performance of grouped messges, but that was probably the best part of 20 years ago (I did it!). I don't think any specific performance work has been done in this area since then. As I believe you are putting your groups in the correct logical sequence, it should therefore be easy for the queue manager to identify what groups are incomplete (there's a tree in storage for each group, and a flag indicating if the group is potentially complete and therefore needs to be examined in more detail). There's also a more general queue index consisting of 48 bytes of data per message. Clearly 48 bytes isn't enough to hold the whole MQMD and the 48 bytes is intended to contain sufficient info to make searching for messages relatively efficient. As your target queue has lots of messages in incomplete groups an MQGET is going to have to go through the message chain looking for the first available message in a whole group. It's possible that this is causing more overhead than is intended and that the queue is locking up due to this activity. This would be massively magnified if the applications were polling for messages. Similarly, the queue manager tries to give some priority to MQGET over MQPUT, and so if you've got multiple consumers (there's a single channel putting to the queue) they might overwhelm the CPU processing gets with return NO_MSG_AVAILABLE. The queue manager keeps a bunch of counters for each queue which measure how efficiently the search for messages is behaving. You have to use the amqldmpa service utility to access these counters and that would be best done under the direction of MQ support.
    4. Yes, that's the parameter. If you're looking at curdepth then you have to bear in mind that curdepth includes uncommitted puts and so will be rising 1 by 1 even if the batch completion is making messages available 50 at a time. 


    ------------------------------
    Andrew Hickson
    ------------------------------



  • 9.  RE: Message Grouping via Transmissionqueuemanager not working?

    IBM Champion
    Posted Fri March 17, 2023 11:14 AM

    I see you talking about the SCTQ... Would that also mean that you have multiple instances of the target queue?

    What is the default set up for open option (either on the queue or the application)? I do hope you put 'GROUP' and not 'NOTFIXED' on it!



    ------------------------------
    Francois Brandelik
    ------------------------------



  • 10.  RE: Message Grouping via Transmissionqueuemanager not working?

    Posted Wed March 22, 2023 02:51 AM

    1. You probably know just as much as I do that without an actual situation where we would take data for PMRs will mostly result in no solution, we did however prepare commands should this case arise again, which is very unlikely but still, should it happen we can yank the data.
    2. That answers our question, the message grouping gets ignored at that point, thus the only solution we have at that point would be to make sure that the target queue is big enough. I still find it confusing how so many incomplete message groups managed their way in but at least it is technically possible for this to occur.
    3. If enough interest should arise, we will try to analyse that particular fact in more detail. As I mentioned, it was a guess/assumption that the getting application is what causes that weird and unresponsive behaviour, despite the actual server resources being plentiful, so it seems the bottleneck was somewhere else or at least not visible in the glance we took. (Might have been too many I/Os, disk usage, we have not checked every possible piece in the puzzle there due to the necessity to solve it fast)
    4. I see, another thing to play around with though I do not think it will serve a purpose in our particular case

    Thank you so much for the detailed answers, it provided some insight and we know how to handle the situation in the future :)

    @Francois Brandelik 

    In normal setups, we do have 2 target queuemanagers with the queue, and yes, we do use bind on group to not split them. In our case we only had one assigned to have the other one handle other business due to the sheer amount of messages that were scheduled.



    ------------------------------
    Sebastian Wilk
    ------------------------------



  • 11.  RE: Message Grouping via Transmissionqueuemanager not working?

    Posted Wed March 22, 2023 05:13 AM

    Sebastian,

    I've been thinking about this problem a little more since my last post. My knowledge here is limited to my memory of how the code worked a couple of years ago, however the grouping/segmentation code has been very stable for many  years and I'd be very surprised if there have been any significant changes in this area since I left.

    As I stated previously, there's 48 bytes of data in memory for every message on a currently accessed queue. This data is intended to allow the most common message searches to complete without having to look outside of this 48 byte area. Beyond this 48 byte/msg index, there is a memory cache of the MQMD's of recently accessed messages, and beyound that cache there's a set of MQ managed buffers and finally there's the disk cache itself. As a search falls out of each level of cached information it gets less efficient. Now 48 bytes is a very small amount, and the use of this 48 bytes is packed as best as possible. My memory of the code is that the structure of this 48 bytes would not currently allow your case to search efficiently. In particular I don't remember there being any indication in this structure of the first message in a group. As such a search for the first complete group from a deep queue full of incomplete groups would need to access the MQMD of almost every message which would exhaust the cache of MQMD's and the MQ managed buffers very quickly. I wouldn't think it would be a big task to fix this (simply a case that was never considered) and it could be done by either adding a single bit to the 48 bytes structure, or by making more efficient use of the existing bits related to groups and segments. If this IS the issue you are facing then simply increasing the maximum allowed depth of the target queue will not resolve your issue! (the main problem being that the search cost isn't linear).

    While there is no guarantee this is the problem you are facing, and no guarantee that IBM will address a performance issue via a PMR, I would strongly encourage you to open a PMR for this issue. At the very least it should ensure that you collect the relevant doc should the problem recurr, and if you're very lucky they might address this (I'd like to think I would have done so if I'd still been there!).

    Andy.



    ------------------------------
    Andrew Hickson
    ------------------------------



  • 12.  RE: Message Grouping via Transmissionqueuemanager not working?

    Posted Wed March 22, 2023 06:50 AM

    I'm not entirely convinced, but you could be onto something. The reason I have some doubts is because once we increased the maxcap of the queue, the slowdown of the queuemanager was gone, despite the messages still sitting there, which would not verify your statement.

    The entire thing is quite nebulous, so I'm going on a limb here.

    Unfortunately I do not think that we can comprehend the entire situation fully unless we find a way to replicate the issues in a test environment.

    We will try create this scenario in test so that we would be able to analyze things properly. 

    If we have new findings, I'll get back here, ideally with some official intel from the IBM support.

    Cheers Andy!



    ------------------------------
    Sebastian Wilk
    ------------------------------