AIX

AIX

Connect with fellow AIX users and experts to gain knowledge, share insights, and solve problems.

 View Only
  • 1.  Hacmp clstop hung

    Posted Mon April 20, 2009 03:15 AM

    Originally posted by: SystemAdmin


    I have two node Cluster on HA5.3, I am not able to stop the cluster. when i am checking the process I could see a clstop process and a config_too_long which is more than a month old. While checking the cspoc.log I could see the current state of cluster is ST_BARRIER. I just need to know whether its safe to kill the hung clstop process and config_too_long in order to bring down the Cluster. Experts pls help me.


  • 2.  Re: Hacmp clstop hung

    Posted Mon April 20, 2009 02:53 PM

    Originally posted by: Casey_B


    Hello,

    First, there is a new forum specifically for HACMP/PowerHA questions.

    Here is the location: http://www.ibm.com/developerworks/forums/forum.jspa?forumID=1611

    Secondly, the question is a bit vague to give reasonable advice.

    Here are some things that you can look at:

    1) You need to know why the cluster is in a barrier state.
    Simplified, Barrier state means that the cluster nodes are waiting for all nodes to finish the current
    step in the recovery plan.

    Look back before the config too long messages occur to see what the last message on each node was in the
    hacmp.out

    2) Killing the config too long process will not help the cluster progress any further in the recovery plan.
    IF there is a process that has been hanging, and you have confirmed it is the last thing
    to be run in the hacmp.out, (For example, your application stop script that appears to have never exited)
    then you COULD try killing that process...the cluster recovery plan would probably continue , and enter
    into a failed state. From the failed state, you can use the smit panel for "Recovery from script failure"
    to continue.

    In the meantime, HA might try to perform actions on your resource groups. (For example, if you kill the
    application script, then HA may try to recover by unmounting the filesystems and varying off the volume groups)

    It really depends what the exact cluster state it.

    All of this of course if without knowing your system, or looking at your logs....
    So, the information and advice could be very wrong....

    If it was my cluster, I would try to manually stop all of my applications...and manually unmount all of the
    filesystems, and manually vary off the volume group. (on all nodes) This way, you know that your data is not being currently
    accessed, and you can work a little bit easier. Then, I would think that a reboot might be the easiest way to bring you to
    a clean state.

    After you are in a clean state, then you can examine your environment and logs in more detail to see what has happened
    in your environment.

    Or, you could call IBM support, they would love to help you, and would be able to review the logs in more detail.

    Hope this helps,
    Casey