MQ

 View Only

RDQM networking setup, best practices and diagnosing issues

By Alex Chatt posted 25 days ago

  
RDQM is a high availability (HA) and disaster recovery (DR) solution for IBM MQ, which is available under the MQ Advanced licence.  It replicates MQ queue manager data and logs to up to 5 other nodes (3 per HA group), providing the resiliency and availability that is essential in the world of messaging today. This is achieved with the help of some third-party packages: Pacemaker, Corosync, and DRBD, which provide the cluster management and data replication capabilities within RDQM.
 
To take advantage of these features, a specific network configuration and setup is needed, as Pacemaker and DRBD need to be able to “talk” to the other nodes present in the configuration. You might naturally ask questions such as “what are the best practices?” and “how do I diagnose issues I may encounter?”. 
In this blog, we try to answer these questions as we revisit the essential networking steps needed for an RDQM environment.

RDQM Quick recap

Before we go into too much detail about the networking configuration needed for RDQM, let’s have a quick recap (or introduction) on the possible setups that a user can configure within RDQM. 
There are three potential setups for an RDQM queue manager :
  • HA only, which involves a 3-node quorum configuration with synchronous data replication between them. 
  • DR only, which involves just 2 nodes with either synchronous or asynchronous replication between them.
  • HA and DR (HA/DR) which involves 6 nodes split into 2 HA groups with DR asynchronous replication happening between them.
Which setup you choose to pick for a queue manager will be based on your individual needs, but there are a few things to keep in mind when choosing your setup. One of these factors, is the network latency. The following list contains the recommended and maximum latency we support for each potential RDQM setup:
  • HA: Up to 5ms latency is supported, 1-2ms latency is recommended
  • Synchronous DR: Up to 5ms latency for synchronous is supported
  • Asynchronous DR: Up to 100ms latency for asynchronous replication is supported
Another factor to consider is the bandwidth of the network interface you plan to use for the data replication. You should ensure that you have a network interface that has the sufficient bandwidth to support the replication requirements given the expected workload of all the replicated data queue managers you plan on creating. You can split the network interfaces that you use for the DR and HA data replication, but we will go into more detail on that later in the blog.

Essential Setup

For HA configuration, the group configuration and management of the three nodes is managed by Pacemaker/Corosync, which continuously check the state of the group to detect if a node has become unavailable. To do this, Pacemaker/Corosync communicate between the RDQM nodes by using UDP on ports 5404-5407. If those ports are not open for UDP traffic, then you will not be able to configure the RDQM group (or Pacemaker won’t be able to monitor the group post configuration).

For HA replication, DRBD is used to replicate data to other nodes. To achieve this, TCP traffic must be enabled on ports 7000-7100 for HA or HA/DR queue managers. Additionally, TCP traffic must be allowed on the port specified when creating DR or HA/-DR queue managers, to allow data to be replicated between the DR primary and the RDQM nodes at the recovery site.


If you are using the default RHEL firewalld, a sample script is supplied with IBM MQ Advanced “MQ_INSTALLATION_PATH/samp/rdqm/firewalld/configure.sh” which can be run to add the required firewalld service rules for the HA group (UDP) and HA replication (TCP), as well as a single default IBM MQ listener (TCP port 1414).

Best Practices

For an RDQM configuration, the HA primary and secondary/alternative heartbeat address, and the HA replication address, are specified in /var/mqm/rdqm.ini. These addresses must be tied to a network interface. Users have 3 choices when setting this up in rdqm.ini:

  1. Specify only the HA_replication interface, which is then used for all HA heartbeat and replication traffic (our recommended approach.)
  2. In addition to HA_replication, specify an HA_primary interface, so that the HA heartbeat traffic uses a separate interface to data replication
  3. Additionally, specify an HA_secondary interface, so that HA heartbeats are sent independently on two interfaces, which can provide increased redundancy and therefore increased resilience to network outages.
As specified above, we generally recommend the use of 1 interface over 2 or 3, paired with network redundancy (like link aggregation) to provide resiliency in case of network issues. The use of 3 interfaces (both the HA primary and HA secondary definition) does give you some redundancy for HA heartbeats, but you won’t have any redundancy for HA replication. As a result, if the data replication link goes down on one node, the queue manager won’t be able to run there. Have this happen on 2 nodes, and quorum will be lost, resulting in the RDQM being unable to run anywhere within the HA group.
 
This is why it might be more beneficial to use 1 interface over 3 and configure redundancy at the network level, ensuring that both your heartbeat and replication traffic has some fallback in the event of a network or interface issue.
 
The following is an example of the rdqm.ini file if you choose to use all 3 interfaces/definitions:

 Node:
    Name: node1.hostname.com
    HA_Primary: 192.168.4.1
    HA_Secondary: 192.168.5.1
    HA_Replication: 192.168.6.1
Node:
    Name: node2.hostname.com
    HA_Primary: 192.168.4.2
    HA_Secondary: 192.168.5.2
    HA_Replication: 192.168.6.2
Node:
    Name: node3.hostname.com
    HA_Primary: 192.168.4.3
    HA_Secondary: 192.168.5.3
  HA_Replication: 192.168.6.3

It is a good idea to consider what advantages you are getting when using 3 interfaces vs 2 vs 1 and figure out which configuration makes the most sense in your setup.
 
You should ensure that the addresses on each interface are in a unique subnet. Pacemaker and DRBD expect responses to their health and connectivity checks to come in on the same interface as they were sent. If two interfaces are within the same subnet, then it’s very possible the response packet will be routed to a different interface which can result in false-positive actions being taken by RDQM, such as moving a queue manager to a different node.

Diagnosing issues

If a user tried to configure their RDQM HA group with the command “rdqmadm -c” (on all nodes or 1 when using passwordless SSH) they might see that every node reports back that it is configured, but it’s peers are not:

[user1@node1 ~]$ rdqmstatus -n
Node node1.hostname.com is online
Node node2.hostname.com is unconfigured
Node node3.hostname.com is unconfigured
 
[user1@node2 ~]$ rdqmstatus -n
Node node1.hostname.com is unconfigured
Node node2.hostname.com is online
Node node3.hostname.com is unconfigured
 
[user1@node3 ~]$ rdqmstatus -n
Node node1.hostname.com is unconfigured
Node node2.hostname.com is unconfigured
Node node3.hostname.com is online

If this is seen, the most likely cause is that the HA heartbeat (primary/secondary) interfaces can’t send packets to each other. As mentioned above, Pacemaker/Corosync uses UDP for ports 5404-5407, so the next step would be to check that UDP traffic can flow between nodes on those ports. 
 
On the DRBD side of things, if we try to create a HA queue manager when TCP traffic is not allowed for ports 7000-7100, you might see something like this.
[user1@node1 ~]$ crtmqm -sx -fs 512m HAQM
Creating replicated data queue manager configuration.
AMQ3879E: Resource 'haqm' not connected to 'node2.hostname.com' after
'10' seconds.
AMQ3812E: Failed to create replicated data queue manager configuration.
Command '/opt/mqm/bin/crtmqm' run with sudo.

Within the system log (rsyslog), you might also see the following to indicate that the DRBD resource could not connect to its peer nodes.

Jun  3 15:38:28 node1 kernel: drbd haqm node2.hostname.com: conn( Connecting -> Disconnecting ) [down]
June 3 15:38:28 node1 kernel: drbd haqm node2.hostname.com: helper command: /sbin/drbdadm disconnected
Jun  3 15:38:28 node1 disconnected[4429]: rdqmhandler(haqm) /var/mqm/rdqm/disconnected
Jun  3 15:38:28 node1 disconnected[4429]: rdqmhandler(haqm) Disconnecting
Jun  3 15:38:28 node1 disconnected[4429]: rdqmhandler(haqm) /var/mqm/rdqm/disconnected rc(0)

The same issues can be seen if the user tries to create a DR queue manager when providing a port that TCP traffic has not been allowed on. After creating the primary and secondary instance of the DR queue manager on the primary and recovery nodes, you might see the following when looking at the status:

Node:                                   node1.hostname.com
Queue manager status:                   Running
CPU:                                    0.69%
Memory:                                 181MB
Queue manager file system:              65MB used, 2.9GB allocated [2%]
DR role:                                Primary
DR status:                              Waiting for initial connection
DR type:                                Asynchronous
DR port:                                8653
DR local IP address:                    192.168.104.7
DR remote IP address:                   192.168.104.13

If you cast your eye to the “DR status”, you can see that the status is stuck in “Waiting for initial connection”. This reflects that DRBD has not been able to start the data sync process between the 2 nodes, since TCP traffic has not been allowed on port 8653 for both nodes.

To try and diagnose these issues, we have several OS tools that we can use to check if TCP/UDP traffic can be sent over a required port, see if a port is in use (or being used by a different process), or see the ports being listened to on other nodes:

  • netstat - to see if something is using local port.
  • nmap - to (remotely) scan for port being listened to
  • ncat - listen mode on one and write mode on other to test pushing data over the port.

Conclusion

Hopefully, you now have a better understanding of the network requirements for an RDQM setup, a better ability to diagnose issues, and some food for thought around how to implement your RDQM network solution. Every implementation can be very different depending on your network requirements and setup, but hopefully this blog has shown how you can fit RDQM into it without much pain.

Useful Links

IBM MQ RDQM network interface Best Practices: https://www.ibm.com/support/pages/node/7107210

Requirements for RDQM HA solution: https://www.ibm.com/docs/en/ibm-mq/9.4?topic=availability-requirements-rdqm-ha-solution

Requirements for RDQM DR solution: https://www.ibm.com/docs/en/ibm-mq/9.4?topic=recovery-requirements-rdqm-dr-solution

0 comments
24 views

Permalink