MQ

Join this online group to communicate across IBM product users and experts by sharing advice and best practices with peers and staying up to date regarding product enhancements.

View Only

Back to discussions

Expand all | Collapse all

MQ Operator: Error after upgrading the OPenshift CLuster

1. MQ Operator: Error after upgrading the OPenshift CLuster

Like
Andres Colodrero
Posted Wed February 26, 2025 06:29 AM

Reply
Hi,

I have 2 QMs with Native HA deployed in our test environment. I execute today a minor upgrade on Openshift, but one of the QMs didnt start.

025-02-26T11:19:13.460Z Using queue manager name: TEST_MQ 2025-02-26T11:19:13.460Z CPU architecture: amd64 2025-02-26T11:19:13.460Z Linux kernel version: 5.14.0-427.50.1.el9_4.x86_64 2025-02-26T11:19:13.461Z Base image: Red Hat Enterprise Linux 9.5 (Plow) 2025-02-26T11:19:13.461Z Running as user ID 1000740000 with primary group 0, and supplementary groups 0,1000740000 2025-02-26T11:19:13.461Z Capabilities: none 2025-02-26T11:19:13.461Z seccomp enforcing mode: filtering 2025-02-26T11:19:13.461Z Process security attributes: system_u:system_r:container_t:s0:c19,c27� 2025-02-26T11:19:13.462Z Detected 'ext4' volume mounted to /mnt/mqm 2025-02-26T11:19:13.462Z Detected 'ext4' volume mounted to /mnt/mqm-data 2025-02-26T11:19:13.462Z Detected 'ext4' volume mounted to /mnt/mqm-log 2025-02-26T11:19:13.491Z Error creating directory structure: the 'crtmqdir' command returned with code: 20. Reason: The filesystem object '/mnt/mqm/data/web/installations/Installation1/servers/mqweb/mqwebuser.xml' is a symbolic link. AMQ6245E: Error executing system call 'open' on file '/mnt/mqm-data/qmgrs/TEST_MQ/qm.ini' error '0'. AMQ6245E: Error executing system call 'mkdir' on file '/mnt/mqm-data/qmgrs/TEST_MQ/autocfg' error '2'. AMQ6245E: Error executing system call 'mkdir' on file '/mnt/mqm-data/qmgrs/TEST_MQ/ssl' error '2'. AMQ6245E: Error executing system call 'mkdir' on file '/mnt/mqm-data/qmgrs/TEST_MQ/plugcomp' error '2'. 2025-02-26T11:19:13.492Z /opt/mqm/bin/crtmqdir: exit status 20

As consecuence, i have 2 pods failing with this error and i can not start the QueueManager (im lucky this is just happening in test environment)

I guess it is a problem with the volumes as the QM is trying to restart again from the configuration. I saw this issue before when i was removing a QMs but not the volumes.

here my qm definition (i remove some fields for simplicity)

spec: web: console: authentication: provider: manual authorization: provider: manual enabled: true manualConfig: configMap: name: mq-web-config version: 9.4.1.1-r1 template: pod: containers: - env: - name: MQ_ENABLE_EMBEDDED_WEB_SERVER value: 'true' name: qmgr resources: {} route: enabled: true name: TEST_MQ mqsc: - configMap: items: - 91-startup.mqsc name: test-mq-mqsc-startup - secret: items: - 92-ldapauth.mqsc name: test-mq-mqsc-ldapauth logFormat: Basic availability: type: NativeHA updateStrategy: RollingUpdate storage: defaultClass: thin-csi persistedData: enabled: true size: 2Gi type: persistent-claim queueManager: class: thin-csi size: 20Gi type: persistent-claim recoveryLogs: enabled: true size: 2Gi type: persistent-claim

------------------------------
Andres Colodrero
------------------------------
2. RE: MQ Operator: Error after upgrading the OPenshift CLuster

Like
Jonathan Rumsey
Posted Thu February 27, 2025 05:38 AM

Reply
The mqwebuser.xml warning is benign, however I agree the AMQ6245E errors would suggest that the persistent volume mount either doesn't contain the queue manager data that MQ is expecting to find or there is some other problem with MQ accessing this. A group of 3 instances provides redundancy for 1 instance to be unavailable at a time, so the inability for 2 instances to access their data will cause an availability outage. If you are not able to spot any obvious differences between the working instances PV and the failing, are you able to raise a support ticket with IBM?

------------------------------
Jonathan Rumsey
Senior Software Engineer
------------------------------

Original Message
3. RE: MQ Operator: Error after upgrading the OPenshift CLuster

Like
Andres Colodrero
Posted Thu February 27, 2025 02:10 PM

Reply
Hi,

This is a test instance that i can destroy and redeploy, but i would like to investigate what happened. i will check if i can open a ticket.

I have a 3 worker nodes, IN this 3 nodes, 2 queue managers with Native HA. 1 QM is ok, and the other failing.

I performed a small upgrade (4.16.1 to latest 4.16) . During upgrade, 1 worker node is drained, upgraded and after that, new pods can be deployed. I guess during the upgrade process i ended having only 1 IBM MQ available pod

➜ ~ oc get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES test-mq-ibm-mq-0 0/1 Running 1 (32h ago) 32h 10.129.2.5 tisocpp01-hjb5q-worker-0-8btzn <none> <none> test-mq-ibm-mq-1 0/1 CrashLoopBackOff 351 (38s ago) 29h 10.128.2.34 tisocpp01-hjb5q-worker-0-kgdnq <none> <none> test-mq-ibm-mq-2 0/1 CrashLoopBackOff 392 (44s ago) 31h 10.128.4.50 tisocpp01-hjb5q-worker-0-7cz7j <none> <none> test-pki-ibm-mq-0 1/1 Running 0 32h 10.129.2.4 tisocpp01-hjb5q-worker-0-8btzn <none> <none> test-pki-ibm-mq-1 0/1 Running 0 32h 10.128.2.4 tisocpp01-hjb5q-worker-0-kgdnq <none> <none> test-pki-ibm-mq-2 0/1 Running 0 32h 10.128.4.8 tisocpp01-hjb5q-worker-0-7cz7j

is there a way to recover from this situation? drop the volumes?

Lesson learnt:

1.Most simple to have 4 nodes

2. control the upgrade process

------------------------------
Andres Colodrero
------------------------------

Original Message

4. RE: MQ Operator: Error after upgrading the OPenshift CLuster

Jonathan Rumsey

Posted Fri February 28, 2025 03:47 AM

It would be good to understand what happened with access to the data, I'm not aware of any issues with 4.16.x and there is nothing obvious in the list of bugfixes Chapter 1. OpenShift Container Platform 4.16 release notes

Redhat

remove preview

Chapter 1. OpenShift Container Platform 4.16 release notes

Chapter 1. OpenShift Container Platform 4.16 release notes | Red Hat Documentation

View this on Redhat >

that might explain this. Native HA instances do have the ability to automatically recover from localised filesystem damage or corruption provided they can still establish connectivity with other healthy instances, this isn't possible in this situation.

Dropping all persistent volumes would indeed cause the queue manager to be recreated.

------------------------------
Jonathan Rumsey
Senior Software Engineer
------------------------------

Original Message

Original Message:
Sent: Thu February 27, 2025 02:10 PM
From: Andres Colodrero
Subject: MQ Operator: Error after upgrading the OPenshift CLuster

Hi,

This is a test instance that i can destroy and redeploy, but i would like to investigate what happened. i will check if i can open a ticket.

I have a 3 worker nodes, IN this 3 nodes, 2 queue managers with Native HA. 1 QM is ok, and the other failing.

I performed a small upgrade (4.16.1 to latest 4.16) . During upgrade, 1 worker node is drained, upgraded and after that, new pods can be deployed. I guess during the upgrade process i ended having only 1 IBM MQ available pod

➜  ~ oc get pods -o wideNAME                                      READY   STATUS             RESTARTS        AGE   IP            NODE                             NOMINATED NODE   READINESS GATEStest-mq-ibm-mq-0                    0/1     Running            1 (32h ago)     32h   10.129.2.5    tisocpp01-hjb5q-worker-0-8btzn   <none>           <none>test-mq-ibm-mq-1                    0/1     CrashLoopBackOff   351 (38s ago)   29h   10.128.2.34   tisocpp01-hjb5q-worker-0-kgdnq   <none>           <none>test-mq-ibm-mq-2                    0/1     CrashLoopBackOff   392 (44s ago)   31h   10.128.4.50   tisocpp01-hjb5q-worker-0-7cz7j   <none>           <none>test-pki-ibm-mq-0             1/1     Running            0               32h   10.129.2.4    tisocpp01-hjb5q-worker-0-8btzn   <none>           <none>test-pki-ibm-mq-1             0/1     Running            0               32h   10.128.2.4    tisocpp01-hjb5q-worker-0-kgdnq   <none>           <none>test-pki-ibm-mq-2             0/1     Running            0               32h   10.128.4.8    tisocpp01-hjb5q-worker-0-7cz7j

is there a way to recover from this situation? drop the volumes?

Lesson learnt:

1.Most simple to have 4 nodes

2. control the upgrade process

------------------------------
Andres Colodrero

Original Message:
Sent: Thu February 27, 2025 05:37 AM
From: Jonathan Rumsey
Subject: MQ Operator: Error after upgrading the OPenshift CLuster

The mqwebuser.xml warning is benign, however I agree the AMQ6245E errors would suggest that the persistent volume mount either doesn't contain the queue manager data that MQ is expecting to find or there is some other problem with MQ accessing this. A group of 3 instances provides redundancy for 1 instance to be unavailable at a time, so the inability for 2 instances to access their data will cause an availability outage. If you are not able to spot any obvious differences between the working instances PV and the failing, are you able to raise a support ticket with IBM?

------------------------------
Jonathan Rumsey
Senior Software Engineer

Original Message:
Sent: Wed February 26, 2025 06:29 AM
From: Andres Colodrero
Subject: MQ Operator: Error after upgrading the OPenshift CLuster

Hi,

I have 2 QMs with Native HA deployed in our test environment. I execute today a minor upgrade on Openshift, but one of the QMs didnt start.

025-02-26T11:19:13.460Z Using queue manager name: TEST_MQ2025-02-26T11:19:13.460Z CPU architecture: amd642025-02-26T11:19:13.460Z Linux kernel version: 5.14.0-427.50.1.el9_4.x86_642025-02-26T11:19:13.461Z Base image: Red Hat Enterprise Linux 9.5 (Plow)2025-02-26T11:19:13.461Z Running as user ID 1000740000 with primary group 0, and supplementary groups 0,10007400002025-02-26T11:19:13.461Z Capabilities: none2025-02-26T11:19:13.461Z seccomp enforcing mode: filtering2025-02-26T11:19:13.461Z Process security attributes: system_u:system_r:container_t:s0:c19,c27�2025-02-26T11:19:13.462Z Detected 'ext4' volume mounted to /mnt/mqm2025-02-26T11:19:13.462Z Detected 'ext4' volume mounted to /mnt/mqm-data2025-02-26T11:19:13.462Z Detected 'ext4' volume mounted to /mnt/mqm-log2025-02-26T11:19:13.491Z Error creating directory structure: the 'crtmqdir' command returned with code: 20. Reason: The filesystem object'/mnt/mqm/data/web/installations/Installation1/servers/mqweb/mqwebuser.xml' isa symbolic link.AMQ6245E: Error executing system call 'open' on file'/mnt/mqm-data/qmgrs/TEST_MQ/qm.ini' error '0'.AMQ6245E: Error executing system call 'mkdir' on file'/mnt/mqm-data/qmgrs/TEST_MQ/autocfg' error '2'.AMQ6245E: Error executing system call 'mkdir' on file'/mnt/mqm-data/qmgrs/TEST_MQ/ssl' error '2'.AMQ6245E: Error executing system call 'mkdir' on file'/mnt/mqm-data/qmgrs/TEST_MQ/plugcomp' error '2'.2025-02-26T11:19:13.492Z /opt/mqm/bin/crtmqdir: exit status 20

As consecuence, i have 2 pods failing with this error and i can not start the QueueManager (im lucky this is just happening in test environment)

I guess it is a problem with the volumes as the QM is trying to restart again from the configuration. I saw this issue before when i was removing a QMs but not the volumes.

here my qm definition (i remove some fields for simplicity)

spec:  web:    console:      authentication:        provider: manual      authorization:        provider: manual    enabled: true    manualConfig:      configMap:        name: mq-web-config  version: 9.4.1.1-r1  template:    pod:      containers:        - env:            - name: MQ_ENABLE_EMBEDDED_WEB_SERVER              value: 'true'          name: qmgr          resources: {}    route:      enabled: true    name: TEST_MQ    mqsc:      - configMap:          items:            - 91-startup.mqsc          name: test-mq-mqsc-startup      - secret:          items:            - 92-ldapauth.mqsc          name: test-mq-mqsc-ldapauth    logFormat: Basic    availability:      type: NativeHA      updateStrategy: RollingUpdate    storage:      defaultClass: thin-csi      persistedData:        enabled: true        size: 2Gi        type: persistent-claim      queueManager:        class: thin-csi        size: 20Gi        type: persistent-claim      recoveryLogs:        enabled: true        size: 2Gi        type: persistent-claim

------------------------------
Andres Colodrero
------------------------------

5. RE: MQ Operator: Error after upgrading the OPenshift CLuster

Like
Andres Colodrero
Posted Fri February 28, 2025 09:41 AM

Reply
Removing all volumes (data, persisted-data and logs) make the pod to start again.

Then i ended with pods, but QM still not available.

I fixed pod number 3, and QM come to work.

So i supose that to end to this situation:

I have 3 nodes

During upgrade, 1 node is drained and restarted. only 2 pods running a QM.

Openshift doesnt know what is the active pod running the QM, so it can drain all of the 3 pods.

Once the Node is up and running, we have again 3 pods runnning the QM.

Then another node is drained before the QM is active again.

So i guess i ended with 2 pods in a split-brain situation?. Is that the case, 4 worker nodes should be the norm.

But i didnt see this behaviur on my dev environment.

------------------------------
Andres Colodrero
------------------------------

Original Message
6. RE: MQ Operator: Error after upgrading the OPenshift CLuster

Like
Jonathan Rumsey
Posted Fri February 28, 2025 10:32 AM

Reply
Removing the PV data (data, persisted-data & logs) for the first failing instance causes the container to recreate it as a 'blank', whilst this instance would now be able to communicate with other instances that are running, it would still need an elected active instance to bootstrap its log data.

An election vote from a 'blank' instance intentionally does not carry the same weight as one from a 'full' instance and so after the recreate, there would still be insufficient voting weight for the instance that could still access it PV data for it to be elected as the best copy of data.

Once the second instance that had the same PV access issues was recreated as 'blank' and restored network connectivity with the other two instances, this allows the instance that can still access its PV data to win the election and then to rebase the two blanks with its copy of data.

Did you retain a copy of the PVs for investigation?

A split-brain (or partitioned state) is a term used where you have two active instances that are both accepting application work and diverging the state of the queue manager, this cannot occur within a single Native HA group due to quorum rules. The reason the queue manager was unable to start was because there were two failed instances that were not able to communicate and hence it was not possible for the remaining instance to win an election.

------------------------------
Jonathan Rumsey
Senior Software Engineer
------------------------------

Original Message

MQ

MQ

MQ Operator: Error after upgrading the OPenshift CLuster

Andres ColodreroWed February 26, 2025 06:29 AM

Jonathan RumseyThu February 27, 2025 05:38 AM

Andres ColodreroThu February 27, 2025 02:10 PM

Jonathan RumseyFri February 28, 2025 03:47 AM

Andres ColodreroFri February 28, 2025 09:41 AM

Jonathan RumseyFri February 28, 2025 10:32 AM

1. MQ Operator: Error after upgrading the OPenshift CLuster

2. RE: MQ Operator: Error after upgrading the OPenshift CLuster

3. RE: MQ Operator: Error after upgrading the OPenshift CLuster

4. RE: MQ Operator: Error after upgrading the OPenshift CLuster

5. RE: MQ Operator: Error after upgrading the OPenshift CLuster

6. RE: MQ Operator: Error after upgrading the OPenshift CLuster

Additional
Resources

Office

Quick Links

MQ

MQ

MQ Operator: Error after upgrading the OPenshift CLuster

Andres ColodreroWed February 26, 2025 06:29 AM

Jonathan RumseyThu February 27, 2025 05:38 AM

Andres ColodreroThu February 27, 2025 02:10 PM

Jonathan RumseyFri February 28, 2025 03:47 AM

Andres ColodreroFri February 28, 2025 09:41 AM

Jonathan RumseyFri February 28, 2025 10:32 AM

1. MQ Operator: Error after upgrading the OPenshift CLuster

2. RE: MQ Operator: Error after upgrading the OPenshift CLuster

3. RE: MQ Operator: Error after upgrading the OPenshift CLuster

4. RE: MQ Operator: Error after upgrading the OPenshift CLuster

5. RE: MQ Operator: Error after upgrading the OPenshift CLuster

6. RE: MQ Operator: Error after upgrading the OPenshift CLuster

Related Content

Expected behavior on AIX Platform when user uninstalls MQ 9004 Fix Pack By keeping QM in running state

On Linux 8.x machines seen the errors libedit version as "unknown" and runmqsc gives “AMQ8521I: Command completion and history unavailable.

IBM MQ 9.2.0.3 and Above MQ versions - Resolution to libedit version display as Unknown and runmqsc as AMQ8521I: Command completion and history unavailable on Linux 8.x

Procedure to put more than 4MB messages from TxSeries to IBM MQ using XA-Client transaction

MQ performance tests in a Docker container

Additional Resources

Office

Quick Links

Additional
Resources