Thank you for your thoughtful advice.
I agree with the opinion that it should be upgraded, but unfortunately, we cannot do that at this moment.
We manged to recreate this problem on the non-production system and made sure CAA recreation could be a solution for cluster recovery.
Thus, we consider applying this solution on the production system.
Original Message:
Sent: Mon March 11, 2024 10:31 AM
From: Michal Wiktorek
Subject: One of PowerHA nodes failed to start after storage migration.
If you've migrated the storage, I think it's worth checking a few things.
- check the SCSI reservation
# lsattr -El hdiskX -a reserve_policy
CAA must have no_reselve policy. If the setting is different (e.g., single path), you must change it using the command:
# chdev -l hdiskX -a reserve_policy=no_reserve (if disk is active, you have to add -P parameter and restart system)
- check the CAA disk identifiers. In the new versions of PowerHA, it is not so important, but in the old versions like 7.1, the following are significant in cluster configuration: PVID, UUID, and the name of hdisk.
# clmgr view report repository (this command will be worked only on active node)
check PVID of disk in ODM on both nodes and compare with lspv output
# odmget HACMPsircol
Check current hdisk name, PVID and UUID name of disk on both nodes(UUID should be visible as last column)
# lspv -u
UUID of disk might have changed after storage migration, so you should compare UUID with command
# lsattr -El cluster0 -a clvdisk
If the number is diffirent than visible in lspv -u output, you can change it using the command (on both nodes)
chdev -l cluster0 -a clvdisk=NEW_UUID
After changing this, the cluster should be restarted and synchronized.
If you notice an entry in the log like the one below, you should find the previous CAA disk in the defined state and delete it, then try to resynchronize the cluster again (The disk name might still be in the cluster configuration, and if it's still visible but in the Defined status, the cluster will continue to attempt to use a non-existent disk - the old device needs to be removed so the cluster can find a new one using a different identifier)
:get_local_nodename[63] : No match - nodename must have changed or must be a new cluster.
# lsdev | grep hdisk | grep Defined
# rmdev -dl hdiskY
After this, try synchronize the cluster.
I hope you manage to quickly upgrade this cluster to a supported version and enjoy the benefits of having IBM support
Best regards,
Michal Wiktorek
------------------------
https://www.linkedin.com/in/michal-wiktorek-83b2b47b/
------------------------
------------------------------
Michal Wiktorek
Original Message:
Sent: Wed January 31, 2024 09:31 AM
From: SHINGO NAGAI
Subject: One of PowerHA nodes failed to start after storage migration.
We have trouble on starting up one of PowerHA cluster nodes after migrating a storage to new one. When migrating a storage, all the disks were copied to those on a new storage except for a repository disk. The old repository disk were replaced with new one from smit menu.
The cluster has 2 nodes and they can have node#2 run alone, but node#1 cannot start up. I have seen the error message below, but I'm sure rhosts is correct because it is the same one as that on node#2.
Error Message: "cl_rsh: node2 cannot be resolved to a valid CAA node name. Check the contents of /etc/cluster/rhosts."
Probably, the problem is "cthags" service on node1 has stayed at "inoperative" state and the following command of starting CAA failed on the node.
# clmgr online node node1 START_CAA=yes
Does anyone come up with an idea about identifying the cause of this problem ? Does it look there is something wrong with CAA ? In this case, do we need to rebuild CAA and PowerHA cluster ?
Regards,
------------------------------
SHINGO NAGAI
------------------------------