Following on from the blog post at https://community.ibm.com/community/user/integration/blogs/trevor-dolby/2022/11/17/ace-and-mq-reconnect-overview, this post describes in greater detail one useful tool when working with ACE and MQ in containers: MQ's automatic reconnect capability.
ACE servers running in containers often use a "remote default queue manager" to interact with MQ as this allows flows to be moved from existing integration nodes without needing every MQ node to be reconfigured to use MQEndpoint policies, but the configuration aspect of the move is not the only thing that needs to be considered. Containers (including MQ containers) have much shorter lifetimes than traditional VMs do, and so handling "connection broken" errors is much more important in the container world. To help with this, ACE provides a way to enable MQ's automatic client reconnect option so that flows see fewer errors (and in many cases none at all).
This post uses an application from https://github.com/trevor-dolby-at-ibm-com/ace-mq-reconnect to illustrate the reconnect option in action, building on the ACE and MQ blog posts describing how to configure and create a queue manager and ACE server in CP4i. As the capabilities are built into the products themselves, CP4i is not required for the examples to work, but is a convenient way to create the required servers and queue managers.
Automatic reconnect as seen from an HTTP client
Background
HTTP-driven ACE integrations provide a good example of how automatic reconnect can help in the container world. However, to understand how the interactions change with the different options it is best to start with a classic multi-instance local default queue manager case (without any reconnect technology at all), such as the following example of HTTP input/reply flows that interact with MQ:

Under normal conditions, the HTTP client will connect to the active ACE server (managed by an integration node in this case), which will be interacting with the local queue manager to process messages. If the queue manager fails for any reason, then the active server will notice this and exit once it has stopped all of the flows; the flows may see MQ errors for a short time before the server exits, with reason codes 2009 and/or 2195 showing that the connection to the queue manager has gone away. As the server is about to exit, however, the MQ errors are often not sent back to the client, so the client will often see no data returned until the connection is closed when the server shuts down.
At this point the HTTP client will not be able to connect to either system (even if the client has multiple addresses to connect to from a DNS entry with multiple A records) because there are no integration servers running on either machine. Once the queue manager has restarted on the standby VM, the integration node will start the server and the flows will become available again. The sequence will look as follows:

Containers
For remote default queue manager scenarios in container systems, the queue manager is in a separate set of containers, and the ACE server has credentials to connect to the queue manager (or set of queue managers, depending on the chosen setup). In this example, a single resilient queue manager is used:

and MQ’s native HA technology could be used to speed up failover:
If the queue manager becomes unavailable, then the single-QM will need to be restarted, while for native HA one of the other replicas will take over with no data loss. In both cases, there will be a break in the client connection from the ACE server to the queue manager (reason code 2009). By default, this will cause the server to exit, and the container will be restarted by Kubernetes; the server will only restart successfully once the queue manager is available again.
As native HA recovers more quickly, the gap in that case is smaller than the previous picture:
However, regardless of the HA option chosen, this still leaves clients with a lot of error handling and retries, and also involves restarting the ACE server, which may itself be a somewhat slow process depending on how many flows are deployed and their startup actions. Simply telling the server not to exit (by setting the server.conf.yaml parameter stopIfDefaultQMUnavailable to false) changes the errors but does not eliminate the gap:
Keeping the server from exiting can also have unexpected consequences if Aggregate or other stateful nodes are used in the flows; the server exits by default to ensure consistent state in all threads in the server.
Using MQ’s automatic reconnect capability allows flows to simply delay their MQ operations until the queue manager has recovered. This is of limited use in cases where queue manager recovery can take minutes, but with the expected native HA takeover time measured in seconds, the picture looks more like this:
There are various configuration parameters that can help the queue manager to recover more quickly (see the https://www.ibm.com/docs/en/ibm-mq/9.2?topic=ha-advanced-tuning-native topic in the IBM MQ docs), but the default should be sufficient for many cases. Using MQ clustering can also help with reconnect time, as the ACE server can connect to another (still available) queue manager in the cluster without waiting for the original queue manager to restart.
ACE can be configured to take advantage of MQ’s automatic reconnect in one of several ways:
- An MQEndpoint policy can explicitly specify the reconnect option as "enabled" (or "queueManager" for same-QM only).
- An MQEndpoint policy can specify the reconnect option as "default" and use a client connection channel with DEFRECON set to YES.
- Set the default reconnect option (and associated timeouts) using the mqclient.ini file; for more details, see https://www.ibm.com/docs/en/ibm-mq/9.2?topic=file-channels-stanza-client-configuration in the IBM MQ docs.
Example showing reconnect in action
The example builds on top of the configuration described in the ACE and MQ blog posts by adding in the ReconnectDemo application from https://github.com/trevor-dolby-at-ibm-com/ace-mq-reconnect to put HTTP messages to an MQ queue. We assume the existence of the QUICKSTART queue manager with a queue called BACKEND (from the MQ blog post) and the availability of the github-barauth, ace-mq-policy, and remote-mq configurations (from the ACE blog post) so the demo application can be deployed and run successfully.
The demo flow is relatively simple, with one MQOutput node connecting to the remote queue manager:

Running without reconnect
The initial tests will use the existing configuration without reconnect enabled and will show the client-visible failures when the queue manager restarts. The ReconnectDemo application can be deployed in one of two ways:
- Using the ACE dashboard, with the application BAR either being built locally or pulled in from https://github.com/trevor-dolby-at-ibm-com/ace-mq-reconnect/raw/main/ReconnectDemo/ReconnectDemo.bar and the github-barauth, ace-mq-policy, and remote-mq configurations specified during the server creation.
- Using the command line (either kubectl or oc) with the appropriate namespace:
kubectl apply -f -n cp4i https://github.com/trevor-dolby-at-ibm-com/ace-mq-reconnect/raw/main/ReconnectDemo/IS-github-bar-ReconnectDemo-no-reconnect.yaml
Once the application is deployed, use "oc get route" to find the HTTP URL for the "reconnect-demo" application and then use curl (or a browser) to invoke the service:
tdolby@hostname:~$ curl http://reconnect-demo-http-ace.apps.cp4i-demo.xxx.com/HTTPInPutNoPause
{"result":"success"}
This command could be run repeatedly while the queue manager is restarted, but this is more easily achieved with a script, and run-curl.sh in the repo will call the service once per second, printing out the call time and also the time when curl received a response
tdolby@hostname:~$ /tmp/run-curl.sh
Running curl at Wed 16 Nov 2022 09:32:14 PM UTC
{"result":"success"} at Wed 16 Nov 2022 09:32:14 PM UTC
Running curl at Wed 16 Nov 2022 09:32:15 PM UTC
{"result":"success"} at Wed 16 Nov 2022 09:32:15 PM UTC
and at this point the queue manager can be restarted to see how the curl output changes. Restarting the queue manager can be achieved by deleting the pod using either the OpenShift console or the command line with kubectl delete pod -n cp4i quickstart-cp4i-ibm-mq-0
(using the appropriate namespace). Once the pod is deleted, ACE should start to see errors from MQ API calls, and the curl output will change:
Running curl at Wed 16 Nov 2022 09:32:16 PM UTC
{"result":"success"} at Wed 16 Nov 2022 09:32:16 PM UTC
Running curl at Wed 16 Nov 2022 09:32:17 PM UTC
<html><body><h1>502 Bad Gateway</h1>
The server returned an invalid or incomplete response.
</body></html>
at Wed 16 Nov 2022 09:32:21 PM UTC
where the response from the OpenShift router change to a 502 and also took four seconds to return (curl started at 09:32:17 and finished at 09:32:21) instead of the usual sub-second response times. The ACE server is in the process of restarting due to the MQ connection break, which leads to the invalid response being returned to the gateway.
The curl output changes again once ACE is restarting:
Running curl at Wed 16 Nov 2022 09:32:22 PM UTC
[html trimmed]
The application is currently not serving requests at this endpoint. It may not have been started or is still starting.
[more html trimmed]
at Wed 16 Nov 2022 09:32:26 PM UTC
and this continues until the service restarts 37 seconds after the queue manager shutdown:
Running curl at Wed 16 Nov 2022 09:32:54 PM UTC
{"result":"success"} at Wed 16 Nov 2022 09:32:54 PM UTC
As can be seen, the HTTP client sees errors during the restart of the queue manager, and a real application (rather than curl) would have to handle the errors by waiting and retrying. The recovery time could be reduced by using native HA (this demo is using a single resilient queue manager) but not reduced to zero. MQ's automatic reconnect capability, however, can virtually eliminate the errors, and the next section illustrates this in action.
With reconnect
To enable reconnection, the RemoteMQ policy described in the MQ blog post needs to be changed and then deployed as a new configuration called "ace-mq-policy-reconnect". The key change is setting "reconnectOption" to "enabled" as shown in the UI
and in the policyxml file:
<reconnectOption>enabled</reconnectOption>
Once this new policy has been deployed under the new name, the ReconnectDemo application can be reconfigured to use the new policy. This can be achieved using the dashboard or by running kubectl apply -f -n cp4i https://github.com/trevor-dolby-at-ibm-com/ace-mq-reconnect/raw/main/ReconnectDemo/IS-github-bar-ReconnectDemo-with-reconnect.yaml
. A diff of the two YAML files shows that the policy is the only thing that changed, as no changes are needed to the application to enable reconnect.
Once the ACE server has restarted in order to use the new policy, the same run-curl.sh script can be run, and the queue manager restarted. This leads to results that look like this:
tdolby@hostname:~$ /tmp/run-curl.sh
Running curl at Wed 16 Nov 2022 09:25:58 PM UTC
{"result":"success"} at Wed 16 Nov 2022 09:25:58 PM UTC
Running curl at Wed 16 Nov 2022 09:25:59 PM UTC
{"result":"success"} at Wed 16 Nov 2022 09:26:00 PM UTC
Running curl at Wed 16 Nov 2022 09:26:01 PM UTC
{"result":"success"} at Wed 16 Nov 2022 09:26:34 PM UTC
Running curl at Wed 16 Nov 2022 09:26:35 PM UTC
{"result":"success"} at Wed 16 Nov 2022 09:26:35 PM UTC
Running curl at Wed 16 Nov 2022 09:26:36 PM UTC
{"result":"success"} at Wed 16 Nov 2022 09:26:36 PM UTC
where the queue manager was restarted around 09:26:01, leading to the delayed response to that request: the response arrived at 09:26:34 without any errors being received.
The delay could be reduced from 33 seconds by enabling native HA and/or changing tuning parameters (or using an MQ cluster), but the client did not see any errors during the restart and so enabling reconnect may be sufficient for many applications.
Summary
MQ automatic client reconnect provides a way to shield ACE clients from MQ queue manager restarts and failovers, significantly reducing (and in many cases eliminating) MQ-related errors being sent back to the clients. No application changes are needed in most cases, as the option can be enabled during the deploy of the application by changing the MQEndpoint policy.