Possible use of BATCHHB for in-doubt channels

5. RE: Possible use of BATCHHB for in-doubt channels

Like

Norbert Pfister

Posted Fri April 28, 2023 09:17 AM

Hi Mayur,

i would have been surprised if you would have been allowed to share a foreign case, don't bother !
Your excerpts for BATCHHB were clear enough for me.

Thank you to enlighten me regarding shared transmission queues. The reasons to use them instead of private xmitqs are plausible.
But switching from group definition to one common Shared queue will be a huge challenge in a production system.
To delete the old xmitq QSG2 without "in use by applications" and messages is nearly impossible.
We have to set them to "put inhibited" and pray to not disrupt the applications ...

Thank you for all your suggestions !

------------------------------
Norbert Pfister
system engineer
Nuremberg
Germany
------------------------------

Original Message

Original Message:
Sent: Thu April 27, 2023 04:48 AM
From: Mayur RAJA
Subject: Possible use of BATCHHB for in-doubt channels

Hi Norbert,
I checked with our MQ L3 team and unfortunately, due to GDPR rules, I am not permitted to forward cases. However, in my earlier append, I had covered the key points raised in the case. The other thing that the case does touch on is that BATCHHB was introduced to address issues with clustering (Morag mentions this in her update below). So, it will not help in your case.

One thing to note is that for planned outages, you could consider running a CSQUTIL job to terminate resources (listener, channels, channel initiator, queue manager) gracefully before taking down the LPAR for maintenance. However, this clearly would not help for unplanned outages.

If I understand correctly, you want the sender channel in QSG1 to rebind with another instance of the receiver channel in QSG2 when the LPAR that the receiver channel was originally running on is taken down. To achieve this, the receiver channel synchronisation state would need to be held on the shared syncq (SYSTEM.QSG.CHANNEL.SYNCQ). However, you mention tht you are using the private SYSTEM.CHANNEL.SYNCQs. As you took down LPA1 and receiving partner queue manager MQ12, the channel status held on the private syncq on queue manager MQ12 cannot be accessed. Hence, you end up with the indoubt channel situation.

If your sender channel targets an INDISP(GROUP) listener port (as per Morag's update below), the receiver channel will be started as a shared channel and hence the channel synchronisation information will be held in the shared sync queue.

You have been focusing on the receiving end but I think you also need to consider whether your sender channel instances should be serving a shared transmission queue or not ? If your sender channel is running on MQ11 and the receiver on MQ12 and you take down LPA1, you want the sender channel to have access to the correct state to be able to bind with a receiver without running into channel sequence number or channel indoubt issues. Shared channels also benefit from peer level recovery in the event of a queue manager failure.

FYI, for shared queues, messages put to the CF are limited to 63K. Larger messages can be put to a shared queue but they end up on shared message datasets (SMDS). The capacity of the CF can also be increased with the use of storage class (also known as flash) memory. See: https://www.ibm.com/docs/en/ibm-mq/9.3?topic=groups-where-are-shared-queue-messages-held.

Regards .. Mayur

------------------------------
Mayur RAJA

Original Message:
Sent: Wed April 26, 2023 08:25 AM
From: Norbert Pfister
Subject: Possible use of BATCHHB for in-doubt channels

Hi @Mayur RAJA ,

unfortunately i cannot access TS008114032 .

We have the presumption that a batch job on z/OS with many messages was running(conencted to MQ61) exactly at the time of the shutdown of MQ12.
As i understand your information then BATCHHB does not really help in such cases.

Regards, Norbert

------------------------------
Norbert Pfister
system engineer
Nuremberg
Germany

Original Message:
Sent: Tue April 25, 2023 10:30 AM
From: Mayur RAJA
Subject: Possible use of BATCHHB for in-doubt channels

Hi Norbert,

Tony Sharkey (MQ Performance) and I have been looking at the MQ docs for BATCHHB and we must admit that the docs are not very clear on this. We will work with the MQ publications team to improve the words in due course.

As it happens, a customer had raised a case on BATCHHB. The case number is TS008114032. I do not know if you can access the case or not but, in the case, Mark Womack (MQ L2 Service Team) explains precisely how BATCHHB is used. I have cut and pasted the following three paragraphs from the case:

"What is it that constitutes "communication from the receiving channel"?" this communication from the receiving side can be either (a) a batch confirmation flow which the receiving side will send to the sending side when it confirms receipt of the previous batch, or (b) a 'normal' heartbeat flow, to indicate to the sending side that it is still alive/available. If the amount of time that has passed since either (a) or (b) has occurred (from the sending side's perspective), is longer than the setting for BATCHHB, then this special batch heartbeat flow is sent. To be sure, after that special flow is sent, it is the HBINT timer that is used to await the next response, before the batch is then backed out.

Regarding "But if the channel is busy and communicating with the receiver, then MQ won't send a BATCHHB flow?" your understanding is correct, since, if the other end is already determined to be active, then there is no need to send the special batch heartbeat flow and processing continues as usual.

Regarding "Does the batch heartbeat interval (500 milliseconds for example) refer to the length of time MQ waits for a response from the receiver to a BATCHHB flow?" In the BATCHHB topic it mentions "The sending channel waits for a response from the receiving end of the channel for an interval, based on the number of seconds specified in the channel Heartbeat Interval (HBINT) attribute."

The key thing to note here is that the value of BATCHHB is used to determine if it has elapsed (since (a) or (b) mentioned above) and if so, to then send a heartbeat flow to the receiver to determine if the receiver is still active or not. The HBINT value is actually used to wait for a response from the receiver channel to this heartbeat flow. If a response is not received, the current batch of messages (if there is an active and indoubt batch of messages) is backed out, the channel ends, and enters channel retry processing. The heartbeat flow is just a regular heartbeat flow and the BATCHHB essentially allows you to narrow the window for when a channel could go indoubt.

Please take a look at the case. If you cannot access it, let me know and I'll see if I can send you a copy.

If it does not address your question, please do get back to us.

Regards .. Mayur and Tony.

------------------------------
Mayur RAJA

Original Message:
Sent: Mon April 24, 2023 04:08 AM
From: Norbert Pfister
Subject: Possible use of BATCHHB for in-doubt channels

Hi folks,
we recently had some CSQX507E events (again) in our Queue Sharing Groups on z/OS.
Our topology:
- production LPAR's LPA1-LPA8
- MQ level V9.2
- two QSG's , name it QSG1 and QSG2 with both 8 members spread over the LPAR's
- Sender/Receiver group channels QSG1.QSG2 and group xmit queues QSG2

During a z/OS change the CSQX507E happened.
Some of our batch schedules had problems with this situation since their messages got stuck in the xmit queues.
Finally we had to manually resolve the situation .

In this KC topic the BATCHHB (Batch Heartbeat Interval) is mentioned.
We consider the use of it but have no experience with this attribute,
so which applicable value to use is unknown for us .

Regards, Norbert

------------------------------
Norbert Pfister
system engineer
Nuremberg
Germany
------------------------------

6. RE: Possible use of BATCHHB for in-doubt channels

Like

Mayur RAJA

Posted Fri April 28, 2023 10:44 AM
Edited by Mayur RAJA Mon May 01, 2023 09:18 AM

Hi Norbert,
 Your welcome and I now understand the challenge that you would be faced with.

Put disabling either the remote, xmit or alias (if this points to the remote) queue will certainly stop applications from putting new messages to the xmitq. However, I would appreciate it if you could confirm if you are likely to do the following in your environment please ?

By running the sending end of the channel as private and the receiving end as shared, I believe there is the potential to introduce channel sequence number errors.

Let's assume that:

- Sender channel QSG1.QSG2 is started on MQ11 and bound to receiver channel QSG1.QSG2 on MQ12. As messages are transferred from MQ11 to MQ12, the sender stores its state on the private syncq defined on MQ11 while the receiver stores its state in the shared syncq which is accessible by all QSG2 queue managers.

- You stop the sender channel (which results in the the receiver channel stopping too).

- You take down LPA1.

- You start the sender channel on MQ21 on LPA2 and this binds with the receiver channel on MQ22 on LPA2. As the sender channel is private, it will have no state on the private sync defined on MQ21. As the receiver is shared, it will read the shared state for the channel from the shared syncq in QSG2. 

- Apologies but the above paragraph is not actually true. I had forgotten that we can have multiple receiver channels from different partners and in this case, we would not read the shared state for the previous receiver channel instance but create a new shared state for the new receiver channel instance on MQ22.

- When you move messages, since the sender creates new state in the private sync and the receiver uses the shared state that is already in the shared syncq on QSG2, I believe this will result in a channel sequence number mismatch and hence an error.

- Apologies but as the shared state for the receiver channel on the shared syncq is based on the partner Queue Manager name, it will be different from the previous instance and so we will create a new status entry and we will NOT see any channel sequence number issues.

Of course, if your private sender channel always starts on MQ11, then it will only ever have a single channel state in the private sync on MQ11 hence this state should always be consistent with the shared state for the partner receiver channel in the shared syncq on QSG2.

I'd like to apologise for misleading you and for any confusion that I may have caused. Also, I'd like to thank Morag for reminding me about multiple receiver channel instances and for correcting my understanding of it. I think I was being a bit too eager here :-). I will be more careful before appending in future !

Regards .. Mayur

------------------------------
Mayur RAJA
------------------------------

Original Message

Original Message:
Sent: Fri April 28, 2023 09:16 AM
From: Norbert Pfister
Subject: Possible use of BATCHHB for in-doubt channels

Hi Mayur,

i would have been surprised if you would have been allowed to share a foreign case, don't bother !
Your excerpts for BATCHHB were clear enough for me.

Thank you to enlighten me regarding shared transmission queues. The reasons to use them instead of private xmitqs are plausible.
But switching from group definition to one common Shared queue will be a huge challenge in a production system.
To delete the old xmitq QSG2 without "in use by applications" and messages is nearly impossible.
We have to set them to "put inhibited" and pray to not disrupt the applications ...

Thank you for all your suggestions !

------------------------------
Norbert Pfister
system engineer
Nuremberg
Germany

Original Message:
Sent: Thu April 27, 2023 04:48 AM
From: Mayur RAJA
Subject: Possible use of BATCHHB for in-doubt channels

Hi Norbert,
I checked with our MQ L3 team and unfortunately, due to GDPR rules, I am not permitted to forward cases. However, in my earlier append, I had covered the key points raised in the case. The other thing that the case does touch on is that BATCHHB was introduced to address issues with clustering (Morag mentions this in her update below). So, it will not help in your case.

One thing to note is that for planned outages, you could consider running a CSQUTIL job to terminate resources (listener, channels, channel initiator, queue manager) gracefully before taking down the LPAR for maintenance. However, this clearly would not help for unplanned outages.

If I understand correctly, you want the sender channel in QSG1 to rebind with another instance of the receiver channel in QSG2 when the LPAR that the receiver channel was originally running on is taken down. To achieve this, the receiver channel synchronisation state would need to be held on the shared syncq (SYSTEM.QSG.CHANNEL.SYNCQ). However, you mention tht you are using the private SYSTEM.CHANNEL.SYNCQs. As you took down LPA1 and receiving partner queue manager MQ12, the channel status held on the private syncq on queue manager MQ12 cannot be accessed. Hence, you end up with the indoubt channel situation.

If your sender channel targets an INDISP(GROUP) listener port (as per Morag's update below), the receiver channel will be started as a shared channel and hence the channel synchronisation information will be held in the shared sync queue.

You have been focusing on the receiving end but I think you also need to consider whether your sender channel instances should be serving a shared transmission queue or not ? If your sender channel is running on MQ11 and the receiver on MQ12 and you take down LPA1, you want the sender channel to have access to the correct state to be able to bind with a receiver without running into channel sequence number or channel indoubt issues. Shared channels also benefit from peer level recovery in the event of a queue manager failure.

FYI, for shared queues, messages put to the CF are limited to 63K. Larger messages can be put to a shared queue but they end up on shared message datasets (SMDS). The capacity of the CF can also be increased with the use of storage class (also known as flash) memory. See: https://www.ibm.com/docs/en/ibm-mq/9.3?topic=groups-where-are-shared-queue-messages-held.

Regards .. Mayur

------------------------------
Mayur RAJA

Original Message:
Sent: Wed April 26, 2023 08:25 AM
From: Norbert Pfister
Subject: Possible use of BATCHHB for in-doubt channels

Hi @Mayur RAJA ,

unfortunately i cannot access TS008114032 .

We have the presumption that a batch job on z/OS with many messages was running(conencted to MQ61) exactly at the time of the shutdown of MQ12.
As i understand your information then BATCHHB does not really help in such cases.

Regards, Norbert

------------------------------
Norbert Pfister
system engineer
Nuremberg
Germany

Original Message:
Sent: Tue April 25, 2023 10:30 AM
From: Mayur RAJA
Subject: Possible use of BATCHHB for in-doubt channels

Hi Norbert,

Tony Sharkey (MQ Performance) and I have been looking at the MQ docs for BATCHHB and we must admit that the docs are not very clear on this. We will work with the MQ publications team to improve the words in due course.

As it happens, a customer had raised a case on BATCHHB. The case number is TS008114032. I do not know if you can access the case or not but, in the case, Mark Womack (MQ L2 Service Team) explains precisely how BATCHHB is used. I have cut and pasted the following three paragraphs from the case:

"What is it that constitutes "communication from the receiving channel"?" this communication from the receiving side can be either (a) a batch confirmation flow which the receiving side will send to the sending side when it confirms receipt of the previous batch, or (b) a 'normal' heartbeat flow, to indicate to the sending side that it is still alive/available. If the amount of time that has passed since either (a) or (b) has occurred (from the sending side's perspective), is longer than the setting for BATCHHB, then this special batch heartbeat flow is sent. To be sure, after that special flow is sent, it is the HBINT timer that is used to await the next response, before the batch is then backed out.

Regarding "But if the channel is busy and communicating with the receiver, then MQ won't send a BATCHHB flow?" your understanding is correct, since, if the other end is already determined to be active, then there is no need to send the special batch heartbeat flow and processing continues as usual.

Regarding "Does the batch heartbeat interval (500 milliseconds for example) refer to the length of time MQ waits for a response from the receiver to a BATCHHB flow?" In the BATCHHB topic it mentions "The sending channel waits for a response from the receiving end of the channel for an interval, based on the number of seconds specified in the channel Heartbeat Interval (HBINT) attribute."

The key thing to note here is that the value of BATCHHB is used to determine if it has elapsed (since (a) or (b) mentioned above) and if so, to then send a heartbeat flow to the receiver to determine if the receiver is still active or not. The HBINT value is actually used to wait for a response from the receiver channel to this heartbeat flow. If a response is not received, the current batch of messages (if there is an active and indoubt batch of messages) is backed out, the channel ends, and enters channel retry processing. The heartbeat flow is just a regular heartbeat flow and the BATCHHB essentially allows you to narrow the window for when a channel could go indoubt.

Please take a look at the case. If you cannot access it, let me know and I'll see if I can send you a copy.

If it does not address your question, please do get back to us.

Regards .. Mayur and Tony.

------------------------------
Mayur RAJA

Original Message:
Sent: Mon April 24, 2023 04:08 AM
From: Norbert Pfister
Subject: Possible use of BATCHHB for in-doubt channels

Hi folks,
we recently had some CSQX507E events (again) in our Queue Sharing Groups on z/OS.
Our topology:
- production LPAR's LPA1-LPA8
- MQ level V9.2
- two QSG's , name it QSG1 and QSG2 with both 8 members spread over the LPAR's
- Sender/Receiver group channels QSG1.QSG2 and group xmit queues QSG2

During a z/OS change the CSQX507E happened.
Some of our batch schedules had problems with this situation since their messages got stuck in the xmit queues.
Finally we had to manually resolve the situation .

In this KC topic the BATCHHB (Batch Heartbeat Interval) is mentioned.
We consider the use of it but have no experience with this attribute,
so which applicable value to use is unknown for us .

Regards, Norbert

------------------------------
Norbert Pfister
system engineer
Nuremberg
Germany
------------------------------

12. RE: Possible use of BATCHHB for in-doubt channels

Like

Norbert Pfister

Posted Thu April 27, 2023 02:20 AM

Damn, i think i got it now !
All of our QSG's have definitions like START LISTENER TRPTYPE(TCP) PORT(1414) IPADDR(QSG2.Nuremberg.DE) INDISP(QMGR)
There is this page Shared channels in the docs (very useful and clarifying !).

And i found some notes in our team recordings(2012) regarding how to manage the listeners:
Make listeners INDISP(QMGR) for client connections, INDISP(GROUP) is not useful for them !
So we switched to this topology.

Instead jut changing the INDISP we should have added a new listener for QSG inter-communication :
START LISTENER TRPTYPE(TCP) PORT(nnnn) IPADDR(QSG2.Nuremberg.DE) INDISP(GROUP)
and adjust all CONNAME attributes regarding QSG2.Nuremberg.DE for channels type SDR/SVR/CLUSSDR to the new established port.
This is described in section "Configuring SVRCONN channels for a queue sharing group" of Shared channels:
The optimal configuration for SVRCONN channels in a queue sharing group is to set up private listeners in each CHINIT which use a different port number from the point to point channels.

Fortunately this looks like a smooth transition since those point to point channel connections between qmgrs are in our responsibility and don't bother clients.

Thank you very much, @Morag Hughson , as always your tips and hints are very useful.
Hopefully i did understand it :-)

Regards, Norbert

------------------------------
Norbert Pfister
system engineer
Nuremberg
Germany
------------------------------

Original Message

Original Message:
Sent: Wed April 26, 2023 10:01 AM
From: Morag Hughson
Subject: Possible use of BATCHHB for in-doubt channels

I agree that you likely have an architectural problem. You appear to be using an IP Address in your SDR channel CONNAME that is a DVIPA / Sysplex Distributor or some other type of address that represents all the members of QSG2 and thus each time the SDR reconnects it is routed to one of the members. However the targeted port number would appear to be the QSG2 INDISP(QMGR) port number rather than the INDISP(GROUP) port number as evident by the contents of your SYNCQ. The RCVR channel is providing its partner name as the QMgr name and not the QSG name.

This means the SDR has an in doubt batch with a partner QMgr and if it ends indoubt, retries, and connects to a different member of QSG2 it cannot continue because it is indoubt with someone else. This is what your CSQX507E messages are all about.

If you are going to use such an IP Address it must target an INDISP(GROUP) listener port.

Cheers,
Morag

------------------------------
Morag Hughson
MQ Technical Education Specialist
MQGem Software Limited
Website: https://www.mqgem.com

Original Message:
Sent: Wed April 26, 2023 08:08 AM
From: Norbert Pfister
Subject: Possible use of BATCHHB for in-doubt channels

Hi @Morag Hughson ,

regarding the in-doubt situation:
Different channels isn't the case here, there is only this one channel QSG1.QSG2 using xmitq QSG2 (prooved by MO71 :-) ).
SYSTEM.CHANNEL.SYNCQ is QSGDISP(QMGR), so also private.

But i admit that we possibly have an architectural problem because we had this incident occasionally:
When one of LPA1 to LPA8 was re-ipl'ed for maintenance we had some minutes until the receiving qmgr was available again.
This happened at weekends and only for some minutes so in-doubt channels vanished afterwards (for sure, my colleague and me observed this once).
But this time, last Friday, LPA1 was down for 22 hours.

Some more infos about our configuration:
QSG1 has members MQ11 to MQ81 over all 8 LPAR's.
QSG2 has members MQ12 to MQ82 over all 8 LPAR's.
Both channels and xmit queues are private (i suppressed superfluous attributes), here MQ61 for example:

DEFINE CHANNEL('QSG1.QSG2') +
CHLTYPE(SDR) +
QSGDISP(GROUP) +
CONNAME('QSG2.Nuremberg.de') +
XMITQ('QSG2') +
MAXMSGL(4194304) +
HBINT(300) +
KAINT(AUTO) +
DISCINT(6000) +
SEQWRAP(999999999) +
REPLACE

DEFINE QLOCAL('QSG2') +
QSGDISP(GROUP) +
USAGE(XMITQ) +
INDXTYPE(NONE) +
STGCLASS('XMIT') +
MAXDEPTH(999999999) +
MAXMSGL(4194304) +
DEFPRTY(0) +
DEFPSIST(NO) +
DEFPRESP(SYNC) +
DEFREADA(NO) +
DEFBIND(OPEN) +
MSGDLVSQ(PRIORITY) +
PUT(ENABLED) +
GET(ENABLED) +
NOHARDENBO +
BOTHRESH(0) +
NOSHARE +
DEFSOPT(EXCL) +
RETINTVL(999999999) +
PROPCTL(COMPAT) +
TRIGGER +
INITQ('SYSTEM.CHANNEL.INITQ') +
TRIGTYPE(FIRST) +
TRIGMPRI(0) +
TRIGDPTH(1) +
TRIGDATA('QSG1.QSG2') +
REPLACE

Shortly before z/OS maintenance LPA1 was shutdown with the receiving partner MQ12 for sender MQ61 at that time.
Here is the joblog of MQ61CHIN:
16:08:35.900 CSQX599E M61P CSQXRCTL Channel P08P.P01P ended abnormally
16:08:35.900 CSQX206E M61P CSQXRCTL Error sending data,
channel QSG1.QSG2
connection qsg2 (1.2.3.4)
(queue manager MQ12)
TRPTYPE=TCP RC=0000008C reason=76697242
16:08:46.410 CSQX599E M61P CSQXRCTL Channel QSG1.QSG2 ended abnormally
16:08:46.410 LPA6 M61PCHIN CSQX507E CSQX507E MQ61 CSQXRCTL Channel QSG1.QSG2 is in-doubt,
connection MQ12
(queue manager MQ22)
16:09:47.310 CSQX599E MQ61 CSQXRCTL Channel QSG1.QSG2 ended abnormally
16:09:47.310 LPA6 MQ61CHIN CSQX507E CSQX507E MQ61 CSQXRCTL Channel QSG1.QSG2 is in-doubt,
connection MQ12
(queue manager MQ72)
16:10:48.210 LPA6 MQ61CHIN CSQX507E CSQX507E MQ61 CSQXRCTL Channel QSG1.QSG2 is in-doubt,
connection MQ12
(queue manager MQ82)
16:10:48.220 CSQX599E MQ61 CSQXRCTL Channel QSG1.QSG2 ended abnormally
16:11:49.190 CSQX599E MQ61 CSQXRCTL Channel QSG1.QSG2 ended abnormally
16:11:49.190 LPA6 MQ61CHIN CSQX507E CSQX507E MQ61 CSQXRCTL Channel QSG1.QSG2 is in-doubt,
connection MQ12
(queue manager MQ42)
16:12:52.200 CSQX599E MQ61 CSQXRCTL Channel QSG1.QSG2 ended abnormally
16:12:52.200 LPA6 MQ61CHIN CSQX507E CSQX507E MQ61 CSQXRCTL Channel QSG1.QSG2 is in-doubt,
connection MQ12
(queue manager MQ72)

I had a look into the messages of SYSTEM.CHANNEL.SYNCQ . There are as many entries of QSG1.QSG2 as MQ61 has ever had connection to a member of QSG2.
That is understandable as MQ61 has to save the channel status informations like MSGSEQNO etc.
MQ61 tries all other qsg members round-robin (as presumed) but always mentions the original qmgr MQ12 in CSQX206E .
That is irritating me...

Best regards, Norbert

------------------------------
Norbert Pfister
system engineer
Nuremberg
Germany

Original Message:
Sent: Tue April 25, 2023 05:57 PM
From: Morag Hughson
Subject: Possible use of BATCHHB for in-doubt channels

Hi Norbert,

The CSQX507E message, an example shown in full below as a reminder to other readers, is indicating that when a channel tried to start it discovered that another batch of messages for a different channel from this same transmission queue was already in-doubt.

+CSQX507E cpf CSQXRCTL Channel CSQ1.TO.CSQ2.T01 is in-doubt, connection CSQ2 (queue manager ????)+CSQX599E cpf CSQXRCTL Channel CSQ1.TO.CSQ2.T02 ended abnormally

I would ask why you have more than one channel using the same transmission queue. This is an unusual situation to be in.

You also ask about Batch Heartbeat Interval. This is a helpful additional flow added to the channel protocol where the sender channel will check that the partner is still there before marking the batch in-doubt and proceeding with the end of batch processing. This is helpful in situations where your have an unstable network. It is mainly helpful in clustering situations where, if the messages had not become part of an in-doubt batch, they would have been reassigned to another channel to send somewhere else. For non-cluster channels, the messages will be moved by the same channel once the network comes back so there is little benefit to be gained.

For more reading on this, try page 17 of Keeping MQ Channels Up and Running

I would investigate why you have two different channels using the same transmission queue. You mention that this is in a QSG. Are the transmission queues shared or private? Is the running disposition of the channels shared or private (i.e. is the SyncQ in use a shared or private queue).

Cheers,
Morag

------------------------------
Morag Hughson
MQ Technical Education Specialist
MQGem Software Limited
Website: https://www.mqgem.com

Original Message:
Sent: Mon April 24, 2023 04:08 AM
From: Norbert Pfister
Subject: Possible use of BATCHHB for in-doubt channels

Hi folks,
we recently had some CSQX507E events (again) in our Queue Sharing Groups on z/OS.
Our topology:
- production LPAR's LPA1-LPA8
- MQ level V9.2
- two QSG's , name it QSG1 and QSG2 with both 8 members spread over the LPAR's
- Sender/Receiver group channels QSG1.QSG2 and group xmit queues QSG2

During a z/OS change the CSQX507E happened.
Some of our batch schedules had problems with this situation since their messages got stuck in the xmit queues.
Finally we had to manually resolve the situation .

In this KC topic the BATCHHB (Batch Heartbeat Interval) is mentioned.
We consider the use of it but have no experience with this attribute,
so which applicable value to use is unknown for us .

Regards, Norbert

------------------------------
Norbert Pfister
system engineer
Nuremberg
Germany
------------------------------

13. RE: Possible use of BATCHHB for in-doubt channels

Like

IBM Champion

Morag Hughson

Posted Thu April 27, 2023 04:46 AM

Hi Norbert,

Yes, sounds like you have understood fully and your changes will make your channels run much more smoothly. I am glad you also found the appropriate sections in IBM Docs to explain it once you knew what you are looking for too.

Sounds like the extra listener was the perfect solution for you, not too disruptive and yes, separating clients and QMgr-QMgr channels in a QSG environment is definitely a good thing to do.

If you're not already full to the brim with information on this subject, this blog post might also be a good read:-

SVRCONNs and INDISP(SHARED) listeners

Glad I was able to help.

Cheers,
Morag

------------------------------
Morag Hughson
MQ Technical Education Specialist
MQGem Software Limited
Website: https://www.mqgem.com
------------------------------

Original Message

Original Message:
Sent: Thu April 27, 2023 02:20 AM
From: Norbert Pfister
Subject: Possible use of BATCHHB for in-doubt channels

Damn, i think i got it now !
All of our QSG's have definitions like START LISTENER TRPTYPE(TCP) PORT(1414) IPADDR(QSG2.Nuremberg.DE) INDISP(QMGR)
There is this page Shared channels in the docs (very useful and clarifying !).

And i found some notes in our team recordings(2012) regarding how to manage the listeners:
Make listeners INDISP(QMGR) for client connections, INDISP(GROUP) is not useful for them !
So we switched to this topology.

Instead jut changing the INDISP we should have added a new listener for QSG inter-communication :
START LISTENER TRPTYPE(TCP) PORT(nnnn) IPADDR(QSG2.Nuremberg.DE) INDISP(GROUP)
and adjust all CONNAME attributes regarding QSG2.Nuremberg.DE for channels type SDR/SVR/CLUSSDR to the new established port.
This is described in section "Configuring SVRCONN channels for a queue sharing group" of Shared channels:
The optimal configuration for SVRCONN channels in a queue sharing group is to set up private listeners in each CHINIT which use a different port number from the point to point channels.

Fortunately this looks like a smooth transition since those point to point channel connections between qmgrs are in our responsibility and don't bother clients.

Thank you very much, @Morag Hughson , as always your tips and hints are very useful.
Hopefully i did understand it :-)

Regards, Norbert

------------------------------
Norbert Pfister
system engineer
Nuremberg
Germany

Original Message:
Sent: Wed April 26, 2023 10:01 AM
From: Morag Hughson
Subject: Possible use of BATCHHB for in-doubt channels

I agree that you likely have an architectural problem. You appear to be using an IP Address in your SDR channel CONNAME that is a DVIPA / Sysplex Distributor or some other type of address that represents all the members of QSG2 and thus each time the SDR reconnects it is routed to one of the members. However the targeted port number would appear to be the QSG2 INDISP(QMGR) port number rather than the INDISP(GROUP) port number as evident by the contents of your SYNCQ. The RCVR channel is providing its partner name as the QMgr name and not the QSG name.

This means the SDR has an in doubt batch with a partner QMgr and if it ends indoubt, retries, and connects to a different member of QSG2 it cannot continue because it is indoubt with someone else. This is what your CSQX507E messages are all about.

If you are going to use such an IP Address it must target an INDISP(GROUP) listener port.

Cheers,
Morag

------------------------------
Morag Hughson
MQ Technical Education Specialist
MQGem Software Limited
Website: https://www.mqgem.com

Original Message:
Sent: Wed April 26, 2023 08:08 AM
From: Norbert Pfister
Subject: Possible use of BATCHHB for in-doubt channels

Hi @Morag Hughson ,

regarding the in-doubt situation:
Different channels isn't the case here, there is only this one channel QSG1.QSG2 using xmitq QSG2 (prooved by MO71 :-) ).
SYSTEM.CHANNEL.SYNCQ is QSGDISP(QMGR), so also private.

But i admit that we possibly have an architectural problem because we had this incident occasionally:
When one of LPA1 to LPA8 was re-ipl'ed for maintenance we had some minutes until the receiving qmgr was available again.
This happened at weekends and only for some minutes so in-doubt channels vanished afterwards (for sure, my colleague and me observed this once).
But this time, last Friday, LPA1 was down for 22 hours.

Some more infos about our configuration:
QSG1 has members MQ11 to MQ81 over all 8 LPAR's.
QSG2 has members MQ12 to MQ82 over all 8 LPAR's.
Both channels and xmit queues are private (i suppressed superfluous attributes), here MQ61 for example:

DEFINE CHANNEL('QSG1.QSG2') +
CHLTYPE(SDR) +
QSGDISP(GROUP) +
CONNAME('QSG2.Nuremberg.de') +
XMITQ('QSG2') +
MAXMSGL(4194304) +
HBINT(300) +
KAINT(AUTO) +
DISCINT(6000) +
SEQWRAP(999999999) +
REPLACE

DEFINE QLOCAL('QSG2') +
QSGDISP(GROUP) +
USAGE(XMITQ) +
INDXTYPE(NONE) +
STGCLASS('XMIT') +
MAXDEPTH(999999999) +
MAXMSGL(4194304) +
DEFPRTY(0) +
DEFPSIST(NO) +
DEFPRESP(SYNC) +
DEFREADA(NO) +
DEFBIND(OPEN) +
MSGDLVSQ(PRIORITY) +
PUT(ENABLED) +
GET(ENABLED) +
NOHARDENBO +
BOTHRESH(0) +
NOSHARE +
DEFSOPT(EXCL) +
RETINTVL(999999999) +
PROPCTL(COMPAT) +
TRIGGER +
INITQ('SYSTEM.CHANNEL.INITQ') +
TRIGTYPE(FIRST) +
TRIGMPRI(0) +
TRIGDPTH(1) +
TRIGDATA('QSG1.QSG2') +
REPLACE

Shortly before z/OS maintenance LPA1 was shutdown with the receiving partner MQ12 for sender MQ61 at that time.
Here is the joblog of MQ61CHIN:
16:08:35.900 CSQX599E M61P CSQXRCTL Channel P08P.P01P ended abnormally
16:08:35.900 CSQX206E M61P CSQXRCTL Error sending data,
channel QSG1.QSG2
connection qsg2 (1.2.3.4)
(queue manager MQ12)
TRPTYPE=TCP RC=0000008C reason=76697242
16:08:46.410 CSQX599E M61P CSQXRCTL Channel QSG1.QSG2 ended abnormally
16:08:46.410 LPA6 M61PCHIN CSQX507E CSQX507E MQ61 CSQXRCTL Channel QSG1.QSG2 is in-doubt,
connection MQ12
(queue manager MQ22)
16:09:47.310 CSQX599E MQ61 CSQXRCTL Channel QSG1.QSG2 ended abnormally
16:09:47.310 LPA6 MQ61CHIN CSQX507E CSQX507E MQ61 CSQXRCTL Channel QSG1.QSG2 is in-doubt,
connection MQ12
(queue manager MQ72)
16:10:48.210 LPA6 MQ61CHIN CSQX507E CSQX507E MQ61 CSQXRCTL Channel QSG1.QSG2 is in-doubt,
connection MQ12
(queue manager MQ82)
16:10:48.220 CSQX599E MQ61 CSQXRCTL Channel QSG1.QSG2 ended abnormally
16:11:49.190 CSQX599E MQ61 CSQXRCTL Channel QSG1.QSG2 ended abnormally
16:11:49.190 LPA6 MQ61CHIN CSQX507E CSQX507E MQ61 CSQXRCTL Channel QSG1.QSG2 is in-doubt,
connection MQ12
(queue manager MQ42)
16:12:52.200 CSQX599E MQ61 CSQXRCTL Channel QSG1.QSG2 ended abnormally
16:12:52.200 LPA6 MQ61CHIN CSQX507E CSQX507E MQ61 CSQXRCTL Channel QSG1.QSG2 is in-doubt,
connection MQ12
(queue manager MQ72)

I had a look into the messages of SYSTEM.CHANNEL.SYNCQ . There are as many entries of QSG1.QSG2 as MQ61 has ever had connection to a member of QSG2.
That is understandable as MQ61 has to save the channel status informations like MSGSEQNO etc.
MQ61 tries all other qsg members round-robin (as presumed) but always mentions the original qmgr MQ12 in CSQX206E .
That is irritating me...

Best regards, Norbert

------------------------------
Norbert Pfister
system engineer
Nuremberg
Germany

Original Message:
Sent: Tue April 25, 2023 05:57 PM
From: Morag Hughson
Subject: Possible use of BATCHHB for in-doubt channels

Hi Norbert,

The CSQX507E message, an example shown in full below as a reminder to other readers, is indicating that when a channel tried to start it discovered that another batch of messages for a different channel from this same transmission queue was already in-doubt.

+CSQX507E cpf CSQXRCTL Channel CSQ1.TO.CSQ2.T01 is in-doubt, connection CSQ2 (queue manager ????)+CSQX599E cpf CSQXRCTL Channel CSQ1.TO.CSQ2.T02 ended abnormally

I would ask why you have more than one channel using the same transmission queue. This is an unusual situation to be in.

You also ask about Batch Heartbeat Interval. This is a helpful additional flow added to the channel protocol where the sender channel will check that the partner is still there before marking the batch in-doubt and proceeding with the end of batch processing. This is helpful in situations where your have an unstable network. It is mainly helpful in clustering situations where, if the messages had not become part of an in-doubt batch, they would have been reassigned to another channel to send somewhere else. For non-cluster channels, the messages will be moved by the same channel once the network comes back so there is little benefit to be gained.

For more reading on this, try page 17 of Keeping MQ Channels Up and Running

I would investigate why you have two different channels using the same transmission queue. You mention that this is in a QSG. Are the transmission queues shared or private? Is the running disposition of the channels shared or private (i.e. is the SyncQ in use a shared or private queue).

Cheers,
Morag

------------------------------
Morag Hughson
MQ Technical Education Specialist
MQGem Software Limited
Website: https://www.mqgem.com

Original Message:
Sent: Mon April 24, 2023 04:08 AM
From: Norbert Pfister
Subject: Possible use of BATCHHB for in-doubt channels

Hi folks,
we recently had some CSQX507E events (again) in our Queue Sharing Groups on z/OS.
Our topology:
- production LPAR's LPA1-LPA8
- MQ level V9.2
- two QSG's , name it QSG1 and QSG2 with both 8 members spread over the LPAR's
- Sender/Receiver group channels QSG1.QSG2 and group xmit queues QSG2

During a z/OS change the CSQX507E happened.
Some of our batch schedules had problems with this situation since their messages got stuck in the xmit queues.
Finally we had to manually resolve the situation .

In this KC topic the BATCHHB (Batch Heartbeat Interval) is mentioned.
We consider the use of it but have no experience with this attribute,
so which applicable value to use is unknown for us .

Regards, Norbert

------------------------------
Norbert Pfister
system engineer
Nuremberg
Germany
------------------------------

14. RE: Possible use of BATCHHB for in-doubt channels

Like

Norbert Pfister

Posted Fri January 19, 2024 04:50 AM

At the moment we are "in full swing" to change all our topology.
We have different tasks to do and try to roll them out from stage to stage:
Lab -> Dev -> QA -> Preproduction -> Production

Established new listeners to isolate qmgr-to-qmgr (and qsg) from client connections
Therefore changed the port for sender/cluster channels
created cf structures for the new xmit queues

Now we wanted to create the new xmit queues, e.g.:
DEFINE QLOCAL('QSG2') +
QSGDISP(SHARED) +
DESCR(' ') +
CFSTRUCT('XMIT') +
CLUSTER(' ') +
CLUSNL(' ') +
CLWLRANK(0) +
CLWLPRTY(0) +
CLWLUSEQ(QMGR) +
USAGE(XMITQ) +
CLCHNAME(' ') +
STREAMQ(' ') +
STRMQOS(BESTEF) +
INDXTYPE(NONE) +
STGCLASS('XMIT') +
MAXDEPTH(999999999) +
MAXMSGL(4194304) +
DEFPRTY(0) +
DEFPSIST(NO) +
DEFPRESP(SYNC) +
DEFREADA(NO) +
DEFBIND(OPEN) +
MSGDLVSQ(PRIORITY) +
PUT(ENABLED) +
GET(ENABLED) +
NOHARDENBO +
BOTHRESH(0) +
BOQNAME(' ') +
SHARE +
DEFSOPT(SHARED) +
RETINTVL(999999999) +
PROPCTL(COMPAT) +
CUSTOM(' ') +
TRIGGER +
INITQ('SYSTEM.CHANNEL.INITQ') +
PROCESS(' ') +
TRIGTYPE(FIRST) +
TRIGMPRI(0) +
TRIGDPTH(1) +
TRIGDATA('QSG1.QSG2') +
QDEPTHHI(80) +
QDEPTHLO(40) +
QDPMAXEV(ENABLED) +
QDPHIEV(DISABLED) +
QDPLOEV(DISABLED) +
QSVCINT(999999999) +
QSVCIEV(NONE) +
STATQ(QMGR) +
ACCTQ(QMGR) +
MONQ(QMGR) +
REPLACE
That brings us to the following considerations:

The attributes SHARE and DEFSOPT(SHARED) should be a logical conclusion since they are now shred between the members of the QSG (instead of NOSHARE and DEFSOPT(EXCLUSIVE) when a private xmit queue )
We normally have INDXTYPE(NONE) for xmit queues but my colleague stumbled over the following entry in Documentation Center
Migrating non-shared queues to shared queues
Note:
1. Messages on shared queues are subject to certain restrictions on the maximum message size, message persistence, and queue index type, so you might not be able to move some non-shared queues to a shared queue.
2. You must use the correct index type for shared queues. If you migrate a transmission queue to be a shared queue, the index type must be MSGID.

Does changing the INDXTYPE have any further consequences ?

Best regards,

------------------------------
Norbert Pfister
system engineer
Nuremberg
Germany
------------------------------

Original Message

Original Message:
Sent: Thu April 27, 2023 04:45 AM
From: Morag Hughson
Subject: Possible use of BATCHHB for in-doubt channels

Hi Norbert,

Yes, sounds like you have understood fully and your changes will make your channels run much more smoothly. I am glad you also found the appropriate sections in IBM Docs to explain it once you knew what you are looking for too.

Sounds like the extra listener was the perfect solution for you, not too disruptive and yes, separating clients and QMgr-QMgr channels in a QSG environment is definitely a good thing to do.

If you're not already full to the brim with information on this subject, this blog post might also be a good read:-

SVRCONNs and INDISP(SHARED) listeners

Glad I was able to help.

Cheers,
Morag

------------------------------
Morag Hughson
MQ Technical Education Specialist
MQGem Software Limited
Website: https://www.mqgem.com

Original Message:
Sent: Thu April 27, 2023 02:20 AM
From: Norbert Pfister
Subject: Possible use of BATCHHB for in-doubt channels