MQ

 View Only
  • 1.  RDQM DR : DR status Partitioned

    Posted 25 days ago

    Hi All, 

    Hope you all doing well.

    I require assistance with setting up RDQM DR, where I'm conducting a test in a POC environment to eventually implement it in the live environment.

    I'm utilizing MQ 9.2.06 on an RHEL 8.7 machine hosted on AWS and have ensured all traffic between the servers is enabled.

    During controlled switchovers from Node A to Node B by halting the QM and switching primary to secondary and vice versa, I encounter no issues.

    However, during aggressive shutdown tests on the primary server to simulate actual DR scenarios, I can successfully promote the secondary to primary and start the Queue Manager.

    The problem arises when attempting to bring the stopped Node back up and designate it as secondary. At this point, I observe the replication status as 'Partitioned', and the status of the secondary server on the other Node becomes unavailable.

    I recognize this as a split brain issue and have followed the steps outlined in the provided link, yet I haven't found a resolution. Any assistance would be appreciated.

    https://www.ibm.com/docs/en/ibm-mq/9.1?topic=oidre-resolving-partitioned-split-brain-problem-in-dr-rdqm

    NODE A: 172.31.42.66
    NODE B: 172.31.34.7

    After shuting down 172.31.42.66(NODE_A) Machine , below is status on 172.31.34.7(NODE_B) after making it primary

    [root@ip-172-31-34-7 ~]# rdqmstatus -m RDQM_DR
    Node:
    ip-172-31-34-7.ap-south-1.compute.internal
    Queue manager status:                   Running
    CPU:                                    0.01%
    Memory:                                 104MB
    Queue manager file system:              49MB used, 2.9GB allocated [2%]
    DR role:                                Primary
    DR status:                              Partitioned
    DR type:                                Synchronous
    DR port:                                7000
    DR local IP address:                    172.31.34.7
    DR remote IP address:                   172.31.42.66
    DR out of sync data:                    696KB
    DR last in sync:                        2024-06-09 11:56:38
    [root@ip-172-31-34-7 ~]# rdqmstatus
    Node:
    ip-172-31-34-7.ap-south-1.compute.internal
    OS kernel version:                      4.18.0-425.19.2
    DRBD OS kernel version:                 4.18.0-425.10.1
    DRBD version:                           9.1.12
    DRBD kernel module status:              Loaded

    Queue manager name:                     RDQM_DR
    Queue manager status:                   Running
    DR role:                                Primary
    DR status:                              Partitioned


    After starting the failed NODE_A :172.31.42.66

    [root@ip-172-31-34-7 ~]# rdqmstatus -m RDQM_DR
    Node:
    ip-172-31-34-7.ap-south-1.compute.internal
    Queue manager status:                   Running
    CPU:                                    0.01%
    Memory:                                 104MB
    Queue manager file system:              49MB used, 2.9GB allocated [2%]
    DR role:                                Primary
    DR status:                              Partitioned
    DR type:                                Synchronous
    DR port:                                7000
    DR local IP address:                    172.31.34.7
    DR remote IP address:                   172.31.42.66
    DR out of sync data:                    696KB
    DR last in sync:                        2024-06-09 11:56:38
    [root@ip-172-31-34-7 ~]# rdqmstatus
    Node:
    ip-172-31-34-7.ap-south-1.compute.internal
    OS kernel version:                      4.18.0-425.19.2
    DRBD OS kernel version:                 4.18.0-425.10.1
    DRBD version:                           9.1.12
    DRBD kernel module status:              Loaded

    Queue manager name:                     RDQM_DR
    Queue manager status:                   Running
    DR role:                                Primary
    DR status:                              Partitioned

    Changing NODE_A (172.31.42.6) state to secondary 


    [root@ip-172-31-42-66 ~]# rdqmstatus -m RDQM_DR
    Node:
    ip-172-31-42-66.ap-south-1.compute.internal
    Queue manager status:                   Ended immediately
    DR role:                                Secondary
    DR status:                              Remote unavailable
    DR type:                                Synchronous
    DR port:                                7000
    DR local IP address:                    172.31.42.66
    DR remote IP address:                   172.31.34.7
    DR out of sync data:                    28672KB
    DR last in sync:                        2024-06-09 11:58:06
    [root@ip-172-31-42-66 ~]# rdqmstatus
    Node:
    ip-172-31-42-66.ap-south-1.compute.internal
    OS kernel version:                      4.18.0-425.19.2
    DRBD OS kernel version:                 4.18.0-425.10.1
    DRBD version:                           9.1.12
    DRBD kernel module status:              Loaded

    Queue manager name:                     RDQM_DR
    Queue manager status:                   Ended immediately
    DR role:                                Secondary
    DR status:                              Remote unavailable



    ------------------------------
    Regards,
    Bharat Puri
    Infrastructure Architect(IBM/Kyndryl)
    ------------------------------


  • 2.  RE: RDQM DR : DR status Partitioned

    IBM Champion
    Posted 25 days ago

    Hi Bharat,

    Since you say, "[here] is status on .. NODE_B after making it primary" I am assuming that you have decided to keep the data on NODE_B.

    You show rdqmstatus output which shows that the queue manager is running. You also say that you have followed the instructions in the linked webpage.

    You don't mention anything about the synchronisation, nor do you show any rdqmstatus output during the synchronisation.

    So to be clear, are you saying that you have following this set of steps:-

    1. Ensure both queue manager instances are stopped.
    2. Specify that the queue manager on NODE_A is the secondary:
      rdqmdr -m RDQM_DR -s
    3. Specify that the queue manager on NODE_B is the primary:
    4. rdqmdr -m RDQM_DR -p
      Synchronization begins, with the data from the queue manager on the main node being copied to the recovery node.
    5. Check the status of the synchronization:
      rdqmstatus -m RDQM_DR
    6. When the synchronization is complete, start the queue manager on the main node:
      strmqm RDQM_DR

    Can you tell us what happened while the data was being synchronised?

    Can you confirm that you followed ALL of the above steps?

    Cheers,
    Morag



    ------------------------------
    Morag Hughson
    MQ Technical Education Specialist
    MQGem Software Limited
    Website: https://www.mqgem.com
    ------------------------------