AIX

AIX

Connect with fellow AIX users and experts to gain knowledge, share insights, and solve problems.

 View Only
  • 1.  Hacmp: unmount of filesystem even application stop script returncode != 0

    Posted Sat March 21, 2009 05:59 AM

    Originally posted by: Ayaz_Anjum


    Folks,

    I am observing following behaviour of hacmp.

    IF application stop script returns nonzero return code, then hacmp continue to unmount the filesystem. This behavious is undesirable bcz in case application failed to stop hacmp should abort the process of brining the resource group offline and should not unmount the filesystems.

    We are using hacmp 5.4.1 with aix 5.3.07. There is only application start and stop scripts and no application monitoring scripts

    i can think of commenting of fuser -c -k in hacmp events script, but to me this does not look like a clean solution

    any suggesions

    thanks


  • 2.  Re: Hacmp: unmount of filesystem even application stop script returncode != 0

    Posted Mon March 23, 2009 11:50 AM

    Originally posted by: Sprellster


    Why would you want the cluster to abort the resource group movement in this case? The whole purpose of the cluster is to move resources (sometimes forcefully) between machines if they become inoperable for some reason?


  • 3.  Re: Hacmp: unmount of filesystem even application stop script returncode !=

    Posted Mon March 23, 2009 01:17 PM

    Originally posted by: Casey_B


    Hello,
    I understand the choice to have manual intervention when your application stop script fails.
    ( meaning that if you really want PowerHA/HACMP to force the application down, you exit with a zero, and PowerHA/HACMP will do
    it's best to continue on...otherwise you would like to give yourself a chance to handle an unexpected condition manually)

    It would be best to try and code for all conditions in your stop script, though.

    More information about your environment is needed to give you an answer:

    What level of HACMP 5.4.1 are you running? (What service pack)
    When do you see the behaviour? With a fallover, or a resource group move when the releasing node will still be active, or a
    resource group move when the releasing node will not be running cluster services afterwards...

    There is an APAR IZ08308 in 5.4.1 that standardizes the behaviour between types of resource group moves.
    This may change the behaviour to what you want.

    Otherwise, I would agree that commenting out fuser would not be a clean workaround. You are exposing
    yourself to changes when you apply service packs.

    I will look further at this.

    Also, There is a new PowerHA/HACMP forum starting.
    There is not much information currently, and the forum hasn't been listed on the main developer works
    forum, but it is open for new topics, and I would like to duplicate the question/response on that
    forum once this is answered:

    http://www.ibm.com/developerworks/forums/forum.jspa?forumID=1611

    Thanks,
    Casey


  • 4.  Re: Hacmp: unmount of filesystem even application stop script returncode !=

    Posted Wed March 25, 2009 04:46 AM

    Originally posted by: Ayaz_Anjum


    thanks,

    The reason for this behaviouring being underable is bcz of datacurruption is that there is a chance of datacurrption in case of databases. I have experience this with out SAP application, when the stop script failed to bring down the database and returned non-zero code and then hacmp failed over the filesytem - which resulted in oracle crash.

    Pl see the following
    http://www-01.ibm.com/support/docview.wss?uid=isg1IZ10028
    apparently its waiting for DCR

    Coding all scenarios in application stop script is not easy cause there can be many reasons for example why oracle is not shutting down, and the best would be to leave it to DBA's to figure out how to stop database if script is failing to do so.

    We are usnig SP4

    thanks, Ayaz


  • 5.  Re: Hacmp: unmount of filesystem even application stop script returncode !=

    Posted Wed March 25, 2009 09:27 AM

    Originally posted by: Casey_B


    Oracle should be pretty good about recovering from a forced exit.
    Most enterprise level databases have transaction logging to be able to recover
    the consistency of the database when killed.

    From personal experience, I seem to remember that db2 was pretty good at recovering after being killed.
    (In my clusters, if you didn't stop normally, you were killed, and shared memory removed, etc...)

    db2 needed to be started with a flag to use those logs, and recover...
    Maybe your Oracle start scripts aren't set up with the right flags to recover the database?

    Back to a work around for your problem with the current design of HACMP....
    I still think editing the HACMP scripts is the wrong thing to do.

    Maybe, if your application stop script is going to fail....and fail in a way that you don't want
    any further processing to occur....

    Then maybe you send a page, email, etc, print a big message to the logs (The big message is important, so that you don't
    have a co-worker forget what you had configured)...and stop the script from completing.

    Maybe something like this:

    echo "ERROR ERROR, stopping script execution"
    read

    or maybe
    echo "ERROR ERROR....House is falling down"
    sleep 99999

    The script would wait for input on the standard in that it will never get.
    Or it would wait for a very long time.
    The cluster would wait for the script and enter into "config too long"

    Why do I say "maybe" so many times? :)

    Although I can understand making a choice for manual interventions...
    There are some real dangers with not continuing with the fallover.

    Even if the other node was able to start the application, and continue running...it wouldn't.
    - This means possible longer times to recovery.
    The application has not been killed, but is not running well.
    - With the application not killed, it could be accepting incoming connections, not able to write to the
    disk, and just losing data.

    You have to evaluate anything I say with understanding of your environment, and your application.

    Hope this helps
    Casey

    PS. Can I move some of this information into the PowerHA forum?


  • 6.  Re: Hacmp: unmount of filesystem even application stop script returncode !=

    Posted Wed March 25, 2009 03:37 PM

    Originally posted by: Ayaz_Anjum


    Sure Casey, move the thread to HACMP forum.

    Well i would say, capturing all the scenarios of application failure in the script and corresponding recovery action is not easy especially when the system administration and application administration are with different teams. And azthis requires fair knowledge of application as well.

    thanks

    Ayaz