Automation with Power

Power Business Continuity and Automation

Connect, learn, and share your experiences using the business continuity and automation technologies and practices designed to ensure uninterrupted operations and rapid recovery for workloads running on IBM Power systems. 


#Power
#TechXchangeConferenceLab

 View Only
  • 1.  PowerHA 5.5 Cluster Status Stays UNSTABLE after Fallback

    Posted Thu July 30, 2009 12:26 PM

    Originally posted by: SystemAdmin


    Hello thanks for reading my post,

    I have a PowerHA 5.5 SP03 cluster with two nodes on AIX 6.1 TL2 SP4 in a Active/Passive config.

    Everything works just fine except that the cluster status remains UNSTABLE after a fallback.

    To correct this I have to manually run the Recover from Script Error Option in Smit, just after doing this appears some info in the hacmp.out log.

    Here is an extract of the final part of HACMP.OUT:

    WARNING: Cluster Auto_Ambar has been running recovery program 'TE_JOIN_NODE_DEP_COMPLETE' for 180 seconds. Please check cluster status.
    WARNING: Cluster Auto_Ambar has been running recovery program 'TE_JOIN_NODE_DEP_COMPLETE' for 210 seconds. Please check cluster status.
    WARNING: Cluster Auto_Ambar has been running recovery program 'TE_JOIN_NODE_DEP_COMPLETE' for 240 seconds. Please check cluster status.
    WARNING: Cluster Auto_Ambar has been running recovery program 'TE_JOIN_NODE_DEP_COMPLETE' for 270 seconds. Please check cluster status.
    :check_for_site_up_complete+54 [ high = high ]
    :check_for_site_up_complete+54 version=1.4
    :check_for_site_up_complete+55 :check_for_site_up_complete+55 cl_get_path
    HA_DIR=es
    :check_for_site_up_complete+57 STATUS=0
    :check_for_site_up_complete+59 set +u
    :check_for_site_up_complete+61 [ ]
    :check_for_site_up_complete+72 exit 0
    config_too_long: Event 'TE_JOIN_NODE_DEP_COMPLETE' on Cluster Auto_Ambar Completed Successfully.

    Aparently something is hanging the check_for_site_up_complete event during fallbacks.

    Any help/clues appreciated.

    Thanks in Advance,
    Angel Aponte
    Venezuela
    #PowerHA-(Formerly-known-as-HACMP)-Technical-Forum
    #PowerHAforAIX


  • 2.  Re: PowerHA 5.5 Cluster Status Stays UNSTABLE after Fallback

    Posted Mon August 03, 2009 12:41 PM

    Originally posted by: Holgervk


    what does
    /usr/es/sbin/cluster/utilities/cllscustom
    tell?
    #PowerHAforAIX
    #PowerHA-(Formerly-known-as-HACMP)-Technical-Forum


  • 3.  Re: PowerHA 5.5 Cluster Status Stays UNSTABLE after Fallback

    Posted Mon August 03, 2009 12:41 PM

    Originally posted by: Holgervk


    what does
    /usr/es/sbin/cluster/utilities/cllscustom
    tell?
    #PowerHA-(Formerly-known-as-HACMP)-Technical-Forum
    #PowerHAforAIX


  • 4.  Re: PowerHA 5.5 Cluster Status Stays UNSTABLE after Fallback

    Posted Mon August 03, 2009 04:58 PM

    Originally posted by: Casey_B


    Hello Angel,

    My guess is that any error happens much earlier than the log entries that you quoted.

    It looks like site_up_complete runs from start to complete, and doesn't show any signs of being hung in any way.

    The end of the previous script, right before config too long occurs might show what is occurring.
    You will also have to look on all nodes. The cluster might be waiting for one node to complete a step of
    the event before allowing the others to continue.

    You can look for the string "ERROR !!!" for one hint.
    You can also look for any exit command that is non-zero.

    (Particularly look for your application start script, and it's associated monitor,
    if you don't have logging specific to your application start scripts, now would be a really
    good time to add that logging)

    Or this would also be a good call for IBM support.
    Hope this helps,
    Casey
    #PowerHA-(Formerly-known-as-HACMP)-Technical-Forum
    #PowerHAforAIX