Removing the PV data (data, persisted-data & logs) for the first failing instance causes the container to recreate it as a 'blank', whilst this instance would now be able to communicate with other instances that are running, it would still need an elected active instance to bootstrap its log data.
An election vote from a 'blank' instance intentionally does not carry the same weight as one from a 'full' instance and so after the recreate, there would still be insufficient voting weight for the instance that could still access it PV data for it to be elected as the best copy of data.
Once the second instance that had the same PV access issues was recreated as 'blank' and restored network connectivity with the other two instances, this allows the instance that can still access its PV data to win the election and then to rebase the two blanks with its copy of data.
A split-brain (or partitioned state) is a term used where you have two active instances that are both accepting application work and diverging the state of the queue manager, this cannot occur within a single Native HA group due to quorum rules. The reason the queue manager was unable to start was because there were two failed instances that were not able to communicate and hence it was not possible for the remaining instance to win an election.
Original Message:
Sent: Fri February 28, 2025 09:40 AM
From: Andres Colodrero
Subject: MQ Operator: Error after upgrading the OPenshift CLuster
Removing all volumes (data, persisted-data and logs) make the pod to start again.
Then i ended with pods, but QM still not available.
I fixed pod number 3, and QM come to work.
So i supose that to end to this situation:
- I have 3 nodes
- During upgrade, 1 node is drained and restarted. only 2 pods running a QM.
- Openshift doesnt know what is the active pod running the QM, so it can drain all of the 3 pods.
- Once the Node is up and running, we have again 3 pods runnning the QM.
- Then another node is drained before the QM is active again.
So i guess i ended with 2 pods in a split-brain situation?. Is that the case, 4 worker nodes should be the norm.
But i didnt see this behaviur on my dev environment.
------------------------------
Andres Colodrero
Original Message:
Sent: Fri February 28, 2025 03:46 AM
From: Jonathan Rumsey
Subject: MQ Operator: Error after upgrading the OPenshift CLuster
It would be good to understand what happened with access to the data, I'm not aware of any issues with 4.16.x and there is nothing obvious in the list of bugfixes Chapter 1. OpenShift Container Platform 4.16 release notes
Redhat | remove preview |
| Chapter 1. OpenShift Container Platform 4.16 release notes | Chapter 1. OpenShift Container Platform 4.16 release notes | Red Hat Documentation | View this on Redhat > |
|
|
that might explain this. Native HA instances do have the ability to automatically recover from localised filesystem damage or corruption provided they can still establish connectivity with other healthy instances, this isn't possible in this situation.
Dropping all persistent volumes would indeed cause the queue manager to be recreated.
------------------------------
Jonathan Rumsey
Senior Software Engineer
Original Message:
Sent: Thu February 27, 2025 02:10 PM
From: Andres Colodrero
Subject: MQ Operator: Error after upgrading the OPenshift CLuster
Hi,
This is a test instance that i can destroy and redeploy, but i would like to investigate what happened. i will check if i can open a ticket.
I have a 3 worker nodes, IN this 3 nodes, 2 queue managers with Native HA. 1 QM is ok, and the other failing.
I performed a small upgrade (4.16.1 to latest 4.16) . During upgrade, 1 worker node is drained, upgraded and after that, new pods can be deployed. I guess during the upgrade process i ended having only 1 IBM MQ available pod
➜ ~ oc get pods -o wideNAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATEStest-mq-ibm-mq-0 0/1 Running 1 (32h ago) 32h 10.129.2.5 tisocpp01-hjb5q-worker-0-8btzn <none> <none>test-mq-ibm-mq-1 0/1 CrashLoopBackOff 351 (38s ago) 29h 10.128.2.34 tisocpp01-hjb5q-worker-0-kgdnq <none> <none>test-mq-ibm-mq-2 0/1 CrashLoopBackOff 392 (44s ago) 31h 10.128.4.50 tisocpp01-hjb5q-worker-0-7cz7j <none> <none>test-pki-ibm-mq-0 1/1 Running 0 32h 10.129.2.4 tisocpp01-hjb5q-worker-0-8btzn <none> <none>test-pki-ibm-mq-1 0/1 Running 0 32h 10.128.2.4 tisocpp01-hjb5q-worker-0-kgdnq <none> <none>test-pki-ibm-mq-2 0/1 Running 0 32h 10.128.4.8 tisocpp01-hjb5q-worker-0-7cz7j
is there a way to recover from this situation? drop the volumes?
Lesson learnt:
1.Most simple to have 4 nodes
2. control the upgrade process
------------------------------
Andres Colodrero
Original Message:
Sent: Thu February 27, 2025 05:37 AM
From: Jonathan Rumsey
Subject: MQ Operator: Error after upgrading the OPenshift CLuster
The mqwebuser.xml warning is benign, however I agree the AMQ6245E errors would suggest that the persistent volume mount either doesn't contain the queue manager data that MQ is expecting to find or there is some other problem with MQ accessing this. A group of 3 instances provides redundancy for 1 instance to be unavailable at a time, so the inability for 2 instances to access their data will cause an availability outage. If you are not able to spot any obvious differences between the working instances PV and the failing, are you able to raise a support ticket with IBM?
------------------------------
Jonathan Rumsey
Senior Software Engineer
Original Message:
Sent: Wed February 26, 2025 06:29 AM
From: Andres Colodrero
Subject: MQ Operator: Error after upgrading the OPenshift CLuster
Hi,
I have 2 QMs with Native HA deployed in our test environment. I execute today a minor upgrade on Openshift, but one of the QMs didnt start.
025-02-26T11:19:13.460Z Using queue manager name: TEST_MQ2025-02-26T11:19:13.460Z CPU architecture: amd642025-02-26T11:19:13.460Z Linux kernel version: 5.14.0-427.50.1.el9_4.x86_642025-02-26T11:19:13.461Z Base image: Red Hat Enterprise Linux 9.5 (Plow)2025-02-26T11:19:13.461Z Running as user ID 1000740000 with primary group 0, and supplementary groups 0,10007400002025-02-26T11:19:13.461Z Capabilities: none2025-02-26T11:19:13.461Z seccomp enforcing mode: filtering2025-02-26T11:19:13.461Z Process security attributes: system_u:system_r:container_t:s0:c19,c27�2025-02-26T11:19:13.462Z Detected 'ext4' volume mounted to /mnt/mqm2025-02-26T11:19:13.462Z Detected 'ext4' volume mounted to /mnt/mqm-data2025-02-26T11:19:13.462Z Detected 'ext4' volume mounted to /mnt/mqm-log2025-02-26T11:19:13.491Z Error creating directory structure: the 'crtmqdir' command returned with code: 20. Reason: The filesystem object'/mnt/mqm/data/web/installations/Installation1/servers/mqweb/mqwebuser.xml' isa symbolic link.AMQ6245E: Error executing system call 'open' on file'/mnt/mqm-data/qmgrs/TEST_MQ/qm.ini' error '0'.AMQ6245E: Error executing system call 'mkdir' on file'/mnt/mqm-data/qmgrs/TEST_MQ/autocfg' error '2'.AMQ6245E: Error executing system call 'mkdir' on file'/mnt/mqm-data/qmgrs/TEST_MQ/ssl' error '2'.AMQ6245E: Error executing system call 'mkdir' on file'/mnt/mqm-data/qmgrs/TEST_MQ/plugcomp' error '2'.2025-02-26T11:19:13.492Z /opt/mqm/bin/crtmqdir: exit status 20
As consecuence, i have 2 pods failing with this error and i can not start the QueueManager (im lucky this is just happening in test environment)
I guess it is a problem with the volumes as the QM is trying to restart again from the configuration. I saw this issue before when i was removing a QMs but not the volumes.
here my qm definition (i remove some fields for simplicity)
spec: web: console: authentication: provider: manual authorization: provider: manual enabled: true manualConfig: configMap: name: mq-web-config version: 9.4.1.1-r1 template: pod: containers: - env: - name: MQ_ENABLE_EMBEDDED_WEB_SERVER value: 'true' name: qmgr resources: {} route: enabled: true name: TEST_MQ mqsc: - configMap: items: - 91-startup.mqsc name: test-mq-mqsc-startup - secret: items: - 92-ldapauth.mqsc name: test-mq-mqsc-ldapauth logFormat: Basic availability: type: NativeHA updateStrategy: RollingUpdate storage: defaultClass: thin-csi persistedData: enabled: true size: 2Gi type: persistent-claim queueManager: class: thin-csi size: 20Gi type: persistent-claim recoveryLogs: enabled: true size: 2Gi type: persistent-claim
------------------------------
Andres Colodrero
------------------------------