MQ

MQ

Join this online group to communicate across IBM product users and experts by sharing advice and best practices with peers and staying up to date regarding product enhancements.

 View Only
  • 1.  MQ Operator: Error after upgrading the OPenshift CLuster

    Posted Wed February 26, 2025 06:29 AM

    Hi, 

    I have 2 QMs with Native HA deployed in our test environment. I execute today a minor upgrade on Openshift, but one of the QMs didnt start.

    025-02-26T11:19:13.460Z Using queue manager name: TEST_MQ
    2025-02-26T11:19:13.460Z CPU architecture: amd64
    2025-02-26T11:19:13.460Z Linux kernel version: 5.14.0-427.50.1.el9_4.x86_64
    2025-02-26T11:19:13.461Z Base image: Red Hat Enterprise Linux 9.5 (Plow)
    2025-02-26T11:19:13.461Z Running as user ID 1000740000 with primary group 0, and supplementary groups 0,1000740000
    2025-02-26T11:19:13.461Z Capabilities: none
    2025-02-26T11:19:13.461Z seccomp enforcing mode: filtering
    2025-02-26T11:19:13.461Z Process security attributes: system_u:system_r:container_t:s0:c19,c27�
    2025-02-26T11:19:13.462Z Detected 'ext4' volume mounted to /mnt/mqm
    2025-02-26T11:19:13.462Z Detected 'ext4' volume mounted to /mnt/mqm-data
    2025-02-26T11:19:13.462Z Detected 'ext4' volume mounted to /mnt/mqm-log
    2025-02-26T11:19:13.491Z Error creating directory structure: the 'crtmqdir' command returned with code: 20. Reason: The filesystem object
    '/mnt/mqm/data/web/installations/Installation1/servers/mqweb/mqwebuser.xml' is
    a symbolic link.
    AMQ6245E: Error executing system call 'open' on file
    '/mnt/mqm-data/qmgrs/TEST_MQ/qm.ini' error '0'.
    AMQ6245E: Error executing system call 'mkdir' on file
    '/mnt/mqm-data/qmgrs/TEST_MQ/autocfg' error '2'.
    AMQ6245E: Error executing system call 'mkdir' on file
    '/mnt/mqm-data/qmgrs/TEST_MQ/ssl' error '2'.
    AMQ6245E: Error executing system call 'mkdir' on file
    '/mnt/mqm-data/qmgrs/TEST_MQ/plugcomp' error '2'.
    
    2025-02-26T11:19:13.492Z /opt/mqm/bin/crtmqdir: exit status 20

    As consecuence, i have 2 pods failing with this error and i can not start the QueueManager (im lucky this is just happening in test environment)

    I guess it is a problem with the volumes as the QM is trying to restart again from the configuration. I saw this issue before when i was removing a QMs but not the volumes.

    here my qm definition (i remove some fields for simplicity)

    spec:
      web:
        console:
          authentication:
            provider: manual
          authorization:
            provider: manual
        enabled: true
        manualConfig:
          configMap:
            name: mq-web-config
      version: 9.4.1.1-r1
      template:
        pod:
          containers:
            - env:
                - name: MQ_ENABLE_EMBEDDED_WEB_SERVER
                  value: 'true'
              name: qmgr
              resources: {}
        route:
          enabled: true
        name: TEST_MQ
        mqsc:
          - configMap:
              items:
                - 91-startup.mqsc
              name: test-mq-mqsc-startup
          - secret:
              items:
                - 92-ldapauth.mqsc
              name: test-mq-mqsc-ldapauth
        logFormat: Basic
        availability:
          type: NativeHA
          updateStrategy: RollingUpdate
        storage:
          defaultClass: thin-csi
          persistedData:
            enabled: true
            size: 2Gi
            type: persistent-claim
          queueManager:
            class: thin-csi
            size: 20Gi
            type: persistent-claim
          recoveryLogs:
            enabled: true
            size: 2Gi
            type: persistent-claim


    ------------------------------
    Andres Colodrero
    ------------------------------


  • 2.  RE: MQ Operator: Error after upgrading the OPenshift CLuster

    Posted Thu February 27, 2025 05:38 AM

    The mqwebuser.xml warning is benign, however I agree the AMQ6245E errors would suggest that the persistent volume mount either doesn't contain the queue manager data that MQ is expecting to find or there is some other problem with MQ accessing this. A group of 3 instances provides redundancy for 1 instance to be unavailable at a time, so the inability for 2 instances to access their data will cause an availability outage. If you are not able to spot any obvious differences between the working instances PV and the failing, are you able to raise a support ticket with IBM?



    ------------------------------
    Jonathan Rumsey
    Senior Software Engineer
    ------------------------------



  • 3.  RE: MQ Operator: Error after upgrading the OPenshift CLuster

    Posted Thu February 27, 2025 02:10 PM

    Hi,

    This is a test instance that i can destroy and redeploy, but i would like to investigate what happened. i will check if i can open a ticket.

    I have a 3 worker nodes, IN this 3 nodes, 2 queue managers with Native HA. 1 QM is ok, and the other failing.

    I performed a small upgrade (4.16.1 to latest 4.16) . During upgrade, 1 worker node is drained, upgraded and after that, new pods can be deployed. I guess during the upgrade  process i ended having only 1 IBM MQ available pod

    ➜  ~ oc get pods -o wide
    NAME                                      READY   STATUS             RESTARTS        AGE   IP            NODE                             NOMINATED NODE   READINESS GATES
    test-mq-ibm-mq-0                    0/1     Running            1 (32h ago)     32h   10.129.2.5    tisocpp01-hjb5q-worker-0-8btzn   <none>           <none>
    test-mq-ibm-mq-1                    0/1     CrashLoopBackOff   351 (38s ago)   29h   10.128.2.34   tisocpp01-hjb5q-worker-0-kgdnq   <none>           <none>
    test-mq-ibm-mq-2                    0/1     CrashLoopBackOff   392 (44s ago)   31h   10.128.4.50   tisocpp01-hjb5q-worker-0-7cz7j   <none>           <none>
    test-pki-ibm-mq-0             1/1     Running            0               32h   10.129.2.4    tisocpp01-hjb5q-worker-0-8btzn   <none>           <none>
    test-pki-ibm-mq-1             0/1     Running            0               32h   10.128.2.4    tisocpp01-hjb5q-worker-0-kgdnq   <none>           <none>
    test-pki-ibm-mq-2             0/1     Running            0               32h   10.128.4.8    tisocpp01-hjb5q-worker-0-7cz7j

    is there a way to recover from this situation? drop the volumes?

    Lesson learnt:

    1.Most simple to have 4 nodes

    2. control the upgrade process 



    ------------------------------
    Andres Colodrero
    ------------------------------



  • 4.  RE: MQ Operator: Error after upgrading the OPenshift CLuster

    Posted Fri February 28, 2025 03:47 AM

    It would be good to understand what happened with access to the data, I'm not aware of any issues with 4.16.x and there is nothing obvious in the list of bugfixes Chapter 1. OpenShift Container Platform 4.16 release notes

    Redhat remove preview
    Chapter 1. OpenShift Container Platform 4.16 release notes
    Chapter 1. OpenShift Container Platform 4.16 release notes | Red Hat Documentation
    View this on Redhat >

     that might explain this. Native HA instances do have the ability to automatically recover from localised filesystem damage or corruption provided they can still establish connectivity with other healthy instances, this isn't possible in this situation. 

    Dropping all persistent volumes would indeed cause the queue manager to be recreated.



    ------------------------------
    Jonathan Rumsey
    Senior Software Engineer
    ------------------------------



  • 5.  RE: MQ Operator: Error after upgrading the OPenshift CLuster

    Posted Fri February 28, 2025 09:41 AM

    Removing all volumes (data, persisted-data and logs) make the pod to start again.

    Then i ended with pods, but QM still not available.

    I fixed pod number 3, and QM come to work.

    So i supose that to end to this situation:

    1. I have 3 nodes
    2. During upgrade, 1 node is drained and restarted. only 2 pods running a QM.
    3. Openshift doesnt know what is the active pod running the QM, so it can drain all of the 3 pods.
    4. Once the Node is up and running, we have again 3 pods runnning the QM.
    5. Then another node is drained before the QM is active again.

    So i guess i ended with 2 pods in a split-brain situation?. Is that the case, 4 worker nodes should be the norm.

    But i didnt see this behaviur on my dev environment.



    ------------------------------
    Andres Colodrero
    ------------------------------



  • 6.  RE: MQ Operator: Error after upgrading the OPenshift CLuster

    Posted Fri February 28, 2025 10:32 AM

    Removing the PV data (data, persisted-data & logs) for the first failing instance causes the container to recreate it as a 'blank', whilst this instance would now be able to communicate with other instances that are running, it would still need an elected active instance to bootstrap its log data.

    An election vote from a 'blank' instance intentionally does not carry the same weight as one from a 'full' instance and so after the recreate, there would still be insufficient voting weight for the instance that could still access it PV data for it to be elected as the best copy of data.

    Once the second instance that had the same PV access issues was recreated as 'blank' and restored network connectivity with the other two instances, this allows the instance that can still access its PV data to win the election and then to rebase the two blanks with its copy of data.

    Did you retain a copy of the PVs for investigation?

    A split-brain (or partitioned state) is a term used where you have two active instances that are both accepting application work and diverging the state of the queue manager, this cannot occur within a single Native HA group due to quorum rules. The reason the queue manager was unable to start was because there were two failed instances that were not able to communicate and hence it was not possible for the remaining instance to win an election.



    ------------------------------
    Jonathan Rumsey
    Senior Software Engineer
    ------------------------------