MQ

 View Only
Expand all | Collapse all

Possible use of BATCHHB for in-doubt channels

  • 1.  Possible use of BATCHHB for in-doubt channels

    Posted Mon April 24, 2023 04:09 AM

    Hi folks,
    we recently had some CSQX507E events (again) in our Queue Sharing Groups on z/OS.
    Our topology:
    - production LPAR's LPA1-LPA8
    - MQ level V9.2
    - two QSG's , name it QSG1 and QSG2 with both 8 members spread over the LPAR's
    - Sender/Receiver group channels QSG1.QSG2 and group xmit queues QSG2

    During a z/OS change the CSQX507E happened.
    Some of our batch schedules had problems with this situation since their messages got stuck in the xmit queues.
    Finally we had to manually resolve the situation .

    In this KC topic the BATCHHB (Batch Heartbeat Interval) is mentioned. 
    We consider the use of it but have no experience with this attribute,
    so which applicable value to use is unknown for us .

    Regards, Norbert



    ------------------------------
    Norbert Pfister
    system engineer
    Nuremberg
    Germany
    ------------------------------


  • 2.  RE: Possible use of BATCHHB for in-doubt channels

    Posted Tue April 25, 2023 10:30 AM
    Edited by Mayur RAJA Tue April 25, 2023 10:40 AM

    Hi Norbert, 

    Tony Sharkey (MQ Performance) and I have been looking at the MQ docs for BATCHHB and we must admit that the docs are not very clear on this. We will work with the MQ publications team to improve the words in due course. 

    As it happens, a customer had raised a case on BATCHHB. The case number is TS008114032. I do not know if you can access the case or not but, in the case, Mark Womack (MQ L2 Service Team) explains precisely how BATCHHB is used. I have cut and pasted the following three paragraphs from the case:

    "What is it that constitutes "communication from the receiving channel"?" this communication from the receiving side can be either (a) a batch confirmation flow which the receiving side will send to the sending side when it confirms receipt of the previous batch, or (b) a 'normal' heartbeat flow, to indicate to the sending side that it is still alive/available. If the amount of time that has passed since either (a) or (b) has occurred (from the sending side's perspective), is longer than the setting for BATCHHB, then this special batch heartbeat flow is sent. To be sure, after that special flow is sent, it is the HBINT timer that is used to await the next response, before the batch is then backed out.

    Regarding "But if the channel is busy and communicating with the receiver, then MQ won't send a BATCHHB flow?" your understanding is correct, since, if the other end is already determined to be active, then there is no need to send the special batch heartbeat flow and processing continues as usual.

    Regarding "Does the batch heartbeat interval (500 milliseconds for example) refer to the length of time MQ waits for a response from the receiver to a BATCHHB flow?" In the BATCHHB topic it mentions "The sending channel waits for a response from the receiving end of the channel for an interval, based on the number of seconds specified in the channel Heartbeat Interval (HBINT) attribute."

    The key thing to note here is that the value of BATCHHB is used to determine if it has elapsed (since (a) or (b) mentioned above) and if so, to then send a heartbeat flow to the receiver to determine if the receiver is still active or not. The HBINT value is actually used to wait for a response from the receiver channel to this heartbeat flow. If a response is not received, the current batch of messages (if there is an active and indoubt batch of messages) is backed out, the channel ends, and enters channel retry processing. The heartbeat flow is just a regular heartbeat flow and the BATCHHB essentially allows you to narrow the window for when a channel could go indoubt. 

    Please take a look at the case. If you cannot access it, let me know and I'll see if I can send you a copy.

    If it does not address your question, please do get back to us.

    Regards .. Mayur and Tony.



    ------------------------------
    Mayur RAJA
    ------------------------------



  • 3.  RE: Possible use of BATCHHB for in-doubt channels

    Posted Wed April 26, 2023 08:26 AM

    Hi @Mayur RAJA ,

    unfortunately i cannot access TS008114032 . 

    We have the presumption that a batch job on z/OS with many messages was running(conencted to MQ61) exactly at the time of the shutdown of MQ12.
    As i understand your information then BATCHHB does not really help in such cases.

    Regards, Norbert



    ------------------------------
    Norbert Pfister
    system engineer
    Nuremberg
    Germany
    ------------------------------



  • 4.  RE: Possible use of BATCHHB for in-doubt channels

    Posted Thu April 27, 2023 04:49 AM
    Edited by Mayur RAJA Thu April 27, 2023 10:47 AM

    Hi Norbert,
    I checked with our MQ L3 team and unfortunately, due to GDPR rules, I am not permitted to forward cases. However, in my earlier append, I had covered the key points raised in the case. The other thing that the case does touch on is that BATCHHB was introduced to address issues with clustering (Morag mentions this in her update below). So, it will not help in your case. 

    One thing to note is that for planned outages, you could consider running a CSQUTIL job to terminate resources (listener, channels, channel initiator, queue manager) gracefully before taking down the LPAR for maintenance. However, this clearly would not help for unplanned outages. 

    If I understand correctly, you want the sender channel in QSG1 to rebind with another instance of the receiver channel in QSG2 when the LPAR that the receiver channel was originally running on is taken down. To achieve this, the receiver channel synchronisation state would need to be held on the shared syncq (SYSTEM.QSG.CHANNEL.SYNCQ). However, you mention tht you are using the private SYSTEM.CHANNEL.SYNCQs. As you took down LPA1 and receiving partner queue manager MQ12, the channel status held on the private syncq on queue manager MQ12 cannot be accessed. Hence, you end up with the indoubt channel situation.

    If your sender channel targets an INDISP(GROUP) listener port (as per Morag's update below), the receiver channel will be started as a shared channel and hence the channel synchronisation information will be held in the shared sync  queue.

    You have been focusing on the receiving end but I think you also need to consider whether your sender channel instances should be serving a shared transmission queue or not ? If your sender channel is running on MQ11 and the receiver on MQ12 and you take down LPA1, you want the sender channel to have access to the correct state to be able to bind with a receiver without running into channel sequence number or channel indoubt issues. Shared channels also benefit from peer level recovery in the event of a queue manager failure. 

    FYI, for shared queues, messages put to the CF are limited to 63K. Larger messages can be put to a shared queue but they end up on shared message datasets (SMDS). The capacity of the CF can also be increased with the use of storage class (also known as flash) memory. See: https://www.ibm.com/docs/en/ibm-mq/9.3?topic=groups-where-are-shared-queue-messages-held.

    Regards .. Mayur



    ------------------------------
    Mayur RAJA
    ------------------------------



  • 5.  RE: Possible use of BATCHHB for in-doubt channels

    0