With the general move to containerization across the industry, ACE is more frequently deployed to use MQ queue managers remotely, and this can bring challenges when migrating from existing infrastructure that relied on co-location of MQ and ACE (and predecessor products such as IIB). Quite a few questions have come up about how ACE manages MQ connections in the container world, and how reconnection takes place (if it does) when MQ becomes unavailable for any reason. This post explains some of the key behaviors in the hope that this will help people moving into containers.
Those with long familiarity with ACE and IIB might want to skip to the scenarios.
Background
The original "broker" product was MQSI (MQSeries Integrator) and originally required a local queue manager to be configured, relying on it for admin communication as well as business messaging. A broker would fail to start if the queue manager was not available, and would shut down if the queue manager was stopped. The use of MQ for admin purposes meant that there was always a thread waiting in an MQGET and so broken connections were detected almost immediately, and also meant that there was no way to communicate with an execution group or broker if MQ was not available. In effect, MQSI was considered an extension of the queue manager, and in fact non-MQ transport nodes such as HTTP were not introduced until version 5.
The broker queue manager was also the only queue manager usable by the MQ nodes, and as a result MQ traffic had to be routed to the broker queue manager for processing unless the JMS nodes were used instead of MQ nodes. This resulted in MQ topologies having to be built around the MQSI rather than the broker being able to go where the messages were, and was inconvenient enough that IBM Integration Bus (IIB) v10 lifted this restriction with new capabilities to use multiple local queue managers as well as remote queue managers. New MQEndpoint policies allowed the flow writers and administrators to direct the MQ nodes to queue managers of their choosing, allowing better integration with existing MQ networks.
IIB v10 still required the broker default queue manager to be local, however, which proved somewhat limiting with the move to container-based systems: flows needed to be changed in order to work with remote queue managers provisioned in separate containers, and some nodes (such as the Aggregate and Collector nodes) could not be run without a default queue manager (which has to be local). Running a queue manager in the same container as the ACE server to allow local access is possible, but requires the configuration of persistent storage and also the management of ACE containers as persistent resource managers: ACE is much closer to a transient application runtime than a resource manager, with the usual scale-on-demand requirements, while MQ requires much more careful state management.
ACE v11 introduced the idea of remote default queue managers to enable flows to run unchanged in containers without the need to modify every flow or provide a local queue manager. The remote default would be used in nearly all the same places as the local default had been (see here for details), allowing Aggregate and Collector nodes (and others) to be used:
Non-default queue manager connections are also supported, allowing a single server to connect to multiple queue managers should that be needed. A default queue manager can be either local or remote, but it is not possible to configure both: there can be only one “default” that is used when other options are not specified.
As ACE and IIB no longer use MQ for internal admin communication, servers can stay up in many cases when MQ becomes unavailable; some forms of high availability (HA) such as multi-instance still require integration nodes to shut down when the queue manager terminates. In other cases the servers can stay up and attempt to reconnect to the queue manager, or else wait for MQ itself to reconnect; subsequent sections describe this in more detail.
ACE v12 MQ connections and reconnections
ACE responds to broken queue manager connections by either retrying the connection or else shutting the server down; container infrastructure is expected to restart servers in the latter case for independent servers. If a queue manager connection is broken on one thread in a server, the queue manager is marked as potentially down for all threads to give the other threads a chance to test their connections before using them; if the test fails, then a new connection can be made without errors being returned to the flow.
Different connection options exist for integration servers:
Local default QM
- Used by message flow nodes that have no other MQ connection specified, including Aggregate and Collector nodes.
- Independent servers start even if the QM is unavailable, and stay up if the queue manager goes away. Errors detected when QM is used.
- Integration nodes will not start if the QM is unavailable, and shut down if the queue manager goes away. Errors detected when the QM is used or when a polling thread detects the QM has shut down.
Other local QM
- Used by message flow nodes that specify a queue manager using an MQEndpoint policy or node properties.
- Servers start even if the QM is unavailable, and stay up if the queue manager goes away for both independent and node-associated servers. Errors detected when QM is used.
Remote default QM
- Used by message flow nodes that have no other MQ connection specified, including Aggregate and Collector nodes.
- Not available for integration nodes (as of ACE 12.0.6)
- Independent servers will not start if the QM is unavailable, and shut down if the queue manager goes away. Errors detected when QM is used or when a polling thread detects the QM has shut down.
- Setting stopIfDefaultQMUnavailable to false (default is true) in server.conf.yaml causes the server to stay up if the QM becomes unavailable, but is not advised for use with Aggregate or Collector nodes as internal state may not remain in sync with the QM data in those cases.
Other remote QM
- Used by message flow nodes that specify a queue manager using an MQEndpoint policy or node properties.
- Servers start even if the QM is unavailable, and stay up if the queue manager goes away for both independent and node-associated servers. Errors detected when QM is used.
MQ automatic client reconnect overview
IBM MQ’s client library has an automatic reconnect capability that allows MQ API calls (MQGET, MQPUT, etc) to reconnect to an MQ queue manager without returning an error code to the calling application, subject to timeouts and transaction considerations. This allows applications to interact with highly-available queue managers or clusters of queue managers without needing complicated error-handling code, as the client ensures that the most common errors are handled automatically. Serious errors (such as a whole queue manager cluster being deleted) would still result in the application seeing error codes, but short-lived outages would not be visible. See https://www.ibm.com/docs/en/ibm-mq/9.2?topic=restart-automatic-client-reconnection for more details on automatic reconnect and associated configuration options.
The reconnect capability is often transparent to ACE flows as it mainly results in MQ succeeding when it would otherwise have returned errors, but in some cases the flow will see errors due to broken connections if they occur in the middle of a transaction (and in some other cases). For the non-transactional cases such as sending out-of-syncpoint logging messages to a queue, the reconnect will be invisible as long as the server connection can be restored quickly enough. Flows will usually see MQRC_BACKED_OUT (2003) if the reconnect happens in the middle of a transaction, as the live state of the transaction will not always be preserved across a failover or other reconnect, and can then retry the message flow with the expectation of success.
Longer delays in flow node operation are another visible effect of automatic reconnection, where the flow will appear to be stuck in a node for an extended period of time while MQ reconnects. The maximum time delay is controlled by the MQ reconnect options (see doc link above), but as long as the delay is not longer than (for example) an HTTPInput node timeout, then no errors will be returned to the client.
See https://www.ibm.com/docs/en/app-connect/12.0?topic=properties-mqendpoint-policy for the description of the reconnectOption settings in ACE.
Reconnect scenarios for remote queue managers
Various example scenarios with a highly-available queue manager that fails over at key points during ACE flow operation, illustrating different aspects of reconnection.
HTTPInput and MQOutput
This scenario involves a fire-and-forget MQOutput node, putting a message to an MQ queue without using a transaction (HTTP being non-transactional). Without MQ reconnect, the HTTPInput node will see an exception thrown from the MQOutput node if the queue manager has failed over:
With MQ reconnect enabled, the MQ client will detect the connection failure and switch to the live replica:

This scenario is described in more depth in https://community.ibm.com/community/user/integration/blogs/trevor-dolby/2022/11/16/ace-and-mq-reconnect-as-seen-from-an-http-client including an example that can be run to show the behavior when a queue manager is restarted.
MQInput node
The MQInput nodes appear to reconnect without MQ reconnect being enabled, which has caused confusion in the past. As mentioned above, this happens because ACE detects the connection failure and either shuts the server down or else continues to retry. The latter sequence is as follows, showing ACE retrying:
With MQ reconnect enabled, ACE does not see the error at all unless the queue manager takes too long to fail over, and so continues to wait as normal even in cases (such as using a remote default queue manager) which would normally result in the server shutting down:

MQInput and MQOutput to a remote default queue manager
Combining the MQInput and MQOutput nodes with MQ reconnect allows the server to avoid shutting down if the connection to the queue manager breaks due to transient conditions such as failover. Without MQ reconnect, the ACE server would detect the MQRC_CONNECTION_BROKER 2009 and shut down, but with reconnect enabled the most severe error the server would see is an MQRC_BACKED_OUT 2003 (for transactional operations) and the flow would simply retry the message. In cases where the failover happened during the time the MQInput node was waiting for messages, or when all operations are non-transactional, no errors would be seen at all.
In the case where the failover happens during transactional flow operation, the sequence looks as follows:
HTTPInput and MQOutput marking queue manager as potentially offline
This scenario works with non-default connections and also for servers with stopIfDefaultQMUnavailable set to false (assuming a remote default queue manager), with the ACE MQ connector code noticing that a queue manager was unavailable on one thread and then marking it as suspect on all other threads.
In this case, after the failure is detected on the first thread, the ACE MQ connector knows to test the connection on the other threads, and is able to establish a new connection without returning an error to the flow for the second thread:
Summary
As can be seen from the above, ACE and MQ provide a lot of error-handling and fault tolerance in a container environment, allowing integration flow writers to focus on the messages themselves and also making migration easier from previous topologies into containers. See
https://community.ibm.com/community/user/integration/blogs/trevor-dolby/2022/11/16/ace-and-mq-reconnect-as-seen-from-an-http-client for a client-centered view, with working examples to allow testing and experimentation.
Further discussion and feedback welcome, in comments and elsewhere.