Hi All,
Hope you all doing well.
I require assistance with setting up RDQM DR, where I'm conducting a test in a POC environment to eventually implement it in the live environment.
I'm utilizing MQ 9.2.06 on an RHEL 8.7 machine hosted on AWS and have ensured all traffic between the servers is enabled.
During controlled switchovers from Node A to Node B by halting the QM and switching primary to secondary and vice versa, I encounter no issues.
However, during aggressive shutdown tests on the primary server to simulate actual DR scenarios, I can successfully promote the secondary to primary and start the Queue Manager.
The problem arises when attempting to bring the stopped Node back up and designate it as secondary. At this point, I observe the replication status as 'Partitioned', and the status of the secondary server on the other Node becomes unavailable.
I recognize this as a split brain issue and have followed the steps outlined in the provided link, yet I haven't found a resolution. Any assistance would be appreciated.
https://www.ibm.com/docs/en/ibm-mq/9.1?topic=oidre-resolving-partitioned-split-brain-problem-in-dr-rdqm
NODE A: 172.31.42.66
NODE B: 172.31.34.7
After shuting down 172.31.42.66(NODE_A) Machine , below is status on 172.31.34.7(NODE_B) after making it primary
[root@ip-172-31-34-7 ~]# rdqmstatus -m RDQM_DR
Node:
ip-172-31-34-7.ap-south-1.compute.internal
Queue manager status: Running
CPU: 0.01%
Memory: 104MB
Queue manager file system: 49MB used, 2.9GB allocated [2%]
DR role: Primary
DR status: Partitioned
DR type: Synchronous
DR port: 7000
DR local IP address: 172.31.34.7
DR remote IP address: 172.31.42.66
DR out of sync data: 696KB
DR last in sync: 2024-06-09 11:56:38
[root@ip-172-31-34-7 ~]# rdqmstatus
Node:
ip-172-31-34-7.ap-south-1.compute.internal
OS kernel version: 4.18.0-425.19.2
DRBD OS kernel version: 4.18.0-425.10.1
DRBD version: 9.1.12
DRBD kernel module status: Loaded
Queue manager name: RDQM_DR
Queue manager status: Running
DR role: Primary
DR status: Partitioned
After starting the failed NODE_A :172.31.42.66
[root@ip-172-31-34-7 ~]# rdqmstatus -m RDQM_DR
Node:
ip-172-31-34-7.ap-south-1.compute.internal
Queue manager status: Running
CPU: 0.01%
Memory: 104MB
Queue manager file system: 49MB used, 2.9GB allocated [2%]
DR role: Primary
DR status: Partitioned
DR type: Synchronous
DR port: 7000
DR local IP address: 172.31.34.7
DR remote IP address: 172.31.42.66
DR out of sync data: 696KB
DR last in sync: 2024-06-09 11:56:38
[root@ip-172-31-34-7 ~]# rdqmstatus
Node:
ip-172-31-34-7.ap-south-1.compute.internal
OS kernel version: 4.18.0-425.19.2
DRBD OS kernel version: 4.18.0-425.10.1
DRBD version: 9.1.12
DRBD kernel module status: Loaded
Queue manager name: RDQM_DR
Queue manager status: Running
DR role: Primary
DR status: Partitioned
Changing NODE_A (172.31.42.6) state to secondary
[root@ip-172-31-42-66 ~]# rdqmstatus -m RDQM_DR
Node:
ip-172-31-42-66.ap-south-1.compute.internal
Queue manager status: Ended immediately
DR role: Secondary
DR status: Remote unavailable
DR type: Synchronous
DR port: 7000
DR local IP address: 172.31.42.66
DR remote IP address: 172.31.34.7
DR out of sync data: 28672KB
DR last in sync: 2024-06-09 11:58:06
[root@ip-172-31-42-66 ~]# rdqmstatus
Node:
ip-172-31-42-66.ap-south-1.compute.internal
OS kernel version: 4.18.0-425.19.2
DRBD OS kernel version: 4.18.0-425.10.1
DRBD version: 9.1.12
DRBD kernel module status: Loaded
Queue manager name: RDQM_DR
Queue manager status: Ended immediately
DR role: Secondary
DR status: Remote unavailable
------------------------------
Regards,
Bharat Puri
Infrastructure Architect(IBM/Kyndryl)
------------------------------