PowerHA for AIX

 View Only
  • 1.  One of PowerHA nodes failed to start after storage migration.

    Posted Wed January 31, 2024 11:19 AM

    We have trouble on starting up one of PowerHA cluster nodes after migrating a storage to new one. When migrating a storage, all the disks were copied to those on a new storage except for a repository disk. The old repository disk were replaced with new one from smit menu. 

    The cluster has 2 nodes and they can have node#2 run alone, but node#1 cannot start up. I have seen the error message below, but I'm sure rhosts is correct because it is the same one as that on node#2. 

    Error Message: "cl_rsh: node2 cannot be resolved to a valid CAA node name. Check the contents of /etc/cluster/rhosts."

    Probably, the problem is "cthags" service on node1 has stayed at "inoperative" state and the following command of starting CAA failed on the node.

    # clmgr online node node1 START_CAA=yes

    Does anyone come up with an idea about identifying the cause of this problem ? Does it look there is something wrong with CAA ? In this case, do we need to rebuild CAA and PowerHA cluster ? 

    Regards,



    ------------------------------
    SHINGO NAGAI
    ------------------------------



  • 2.  RE: One of PowerHA nodes failed to start after storage migration.

    Posted Wed February 07, 2024 05:36 PM
    Edited by Mostafa Mahmoud Wed February 07, 2024 05:36 PM

    Hi Shingo,

    That indeed looks to be some issue with the CAA layer. It depends on what hostname the CAA cluster used initially in conjunction with the node's AIX hostname, and the content of the /etc/hosts file.

    The best thing to do for this issue to be resolved is to open a case with IBM support. Such issues need full assessment to get to the culprit.
    ------------------------------
    Regards,
    Mostafa Mahmoud
    AIX / PowerHA / CAA / VMRM / RSCT Development Support Engineer
    ------------------------------



  • 3.  RE: One of PowerHA nodes failed to start after storage migration.

    Posted Thu February 08, 2024 10:50 AM
    Hi Mostafa,
    Thank you for your advice. Actually, this PowerHA version is 7.1.3 SP1 which is out of support. For some reasons, we cannot upgrade it and need to solve this issue with this version. 
     
    In case there is something wrong with CAA, I am thinking that CAA recreation would be a solution. Do you have any thoughts on this ? 
    Specifically, do the step3 (CAA repository disk scrub) in the link below, then, synchronize cluster configuration from node2 to recreate CAA cluster.
    https://www.ibm.com/support/pages/remove-powerha-systemmirror-cluster-configuration-and-rebuild-it-again



    ------------------------------
    SHINGO NAGAI
    ------------------------------



  • 4.  RE: One of PowerHA nodes failed to start after storage migration.

    Posted Sat February 10, 2024 11:27 AM

    Hi Shingo,

    Yes, those steps should be helpful to rebuild the CAA cluster from scratch. Give it a try and post the outcome.

    You also may consider upgrading PowerHA to a supported level.



    ------------------------------
    Regards,
    Mostafa Mahmoud
    AIX / PowerHA / CAA / VMRM / RSCT Development Support Engineer
    ------------------------------



  • 5.  RE: One of PowerHA nodes failed to start after storage migration.

    Posted Sun February 11, 2024 10:48 PM

    Hi Mostafa,
    Thanks. Probably, we'll try the steps a few weeks later or so after a review and schedule arrangement for production system. Once completed, I'll post the outcome. 



    ------------------------------
    SHINGO NAGAI
    ------------------------------



  • 6.  RE: One of PowerHA nodes failed to start after storage migration.

    Posted Wed March 13, 2024 11:15 PM

    We recreated the same problem on non-production system and tried the step3 in the link below, so I'd like to feedback the result.
    https://www.ibm.com/support/pages/remove-powerha-systemmirror-cluster-configuration-and-rebuild-it-again

    If I added a few steps as follows, it worked well. (in case node2 has problem)
    - before doing step3, stop PowerHA and CAA (clmgr offline node <nodename1> STOP_CAA=yes)
    - do step3
    - after doing step3, delete ODM data on node2 (clmgr delete cluster NODES=<nodename2>)
    - do sync from node1
    - start PowerHA service (both node started successfully)

    We haven't tried this on the production system yet, but I am assuming this would also work on the system. 



    ------------------------------
    SHINGO NAGAI
    ------------------------------



  • 7.  RE: One of PowerHA nodes failed to start after storage migration.

    User Group Leader
    Posted Mon March 11, 2024 12:16 PM

    If you've migrated the storage, I think it's worth checking a few things.

    - check the SCSI reservation

    # lsattr -El hdiskX -a reserve_policy
    CAA must have no_reselve policy. If the setting is different (e.g., single path), you must change it using the command:
    # chdev -l hdiskX -a reserve_policy=no_reserve (if disk is active, you have to add -P parameter and restart system)

    - check the CAA disk identifiers. In the new versions of PowerHA, it is not so important, but in the old versions like 7.1, the following are significant in cluster configuration: PVID, UUID, and the name of hdisk.

    # clmgr view report repository (this command will be worked only on active node)

    check PVID of disk in ODM on both nodes and compare with lspv output

    # odmget HACMPsircol

    Check current hdisk name, PVID and UUID name of disk on both nodes(UUID should be visible as last column)
    # lspv -u 

    UUID of disk might have changed after storage migration, so you should compare UUID with command

    # lsattr -El cluster0 -a clvdisk

    If the number is diffirent than visible in lspv -u output, you can change it using the command (on both nodes)

    chdev -l cluster0 -a clvdisk=NEW_UUID

    After changing this, the cluster should be restarted and synchronized. 

    If you notice an entry in the log like the one below, you should find the previous CAA disk in the defined state and delete it, then try to resynchronize the cluster again (The disk name might still be in the cluster configuration, and if it's still visible but in the Defined status, the cluster will continue to attempt to use a non-existent disk - the old device needs to be removed so the cluster can find a new one using a different identifier)

    :get_local_nodename[63] : No match - nodename must have changed or must be a new cluster.

    # lsdev | grep hdisk | grep Defined
    # rmdev -dl hdiskY

    After this, try synchronize the cluster.

    I hope you manage to quickly upgrade this cluster to a supported version and enjoy the benefits of having IBM support

    Best regards,
    Michal Wiktorek

    ------------------------
    https://www.linkedin.com/in/michal-wiktorek-83b2b47b/
    ------------------------



    ------------------------------
    Michal Wiktorek
    ------------------------------



  • 8.  RE: One of PowerHA nodes failed to start after storage migration.

    Posted Wed March 13, 2024 10:55 PM

    Michal,

    Thank you for your thoughtful advice.

    I checked what you pointed out on the servers. 
    - SCSI reservation: Confirmed the setting is "no_reserve policy".
    - PVID: Confirmed all the PVIDs are identical. (lspv, odmget, clmgr)
    - UUID: Confirmed the UUIDs on both nodes are identical. (lspv, lsattr)
    - log: Confirmed there is no entry like what you showed. 

    I agree with the opinion that it should be upgraded, but unfortunately, we cannot do that at this moment. 
    We manged to recreate this problem on the non-production system and made sure CAA recreation could be a solution for cluster recovery.
    Thus, we consider applying this solution on the production system.

    Regards,



    ------------------------------
    SHINGO NAGAI
    ------------------------------