Automation with Power

Power Business Continuity and Automation

Connect, learn, and share your experiences using the business continuity and automation technologies and practices designed to ensure uninterrupted operations and rapid recovery for workloads running on IBM Power systems. 


#Power
#TechXchangeConferenceLab

 View Only
  • 1.  Question on how disk heartbeat should work

    Posted Thu August 13, 2009 05:58 PM

    Originally posted by: Emittim3@Sirius


    All:
    I have a configuration where I have a DS3400 attached to two Power 6 p520 servers, directly attached (no fiber switches). I have a disk heartbeat network through concurrent volume groups for PowerHA, a second heartbeat through TCP/IP. I've run through all sorts of failure scenario testing, and everything is working great, except for one case. When I pull both cables from the active node (two node active/passive cluster), I/O does hang, and PowerHA eventually detects that the disk heartbeat network is down. However, it sits there with hung I/O without ever failing the resource group from the disconnected node to the stand by node. So my question is:

    Should PowerHA detect that it does not have connectivity to storage on the active node, and fail the resource group over to the stand by node (as the stand by node does still have connectivity to storage). Or - is this type of loss of connectivity to storage only something that can be handled through a custom HA event?

    Thanks in advance!

    Cheers,
    Philip Greer
    #PowerHA-(Formerly-known-as-HACMP)-Technical-Forum
    #PowerHAforAIX


  • 2.  Re: Question on how disk heartbeat should work

    Posted Fri August 14, 2009 07:13 AM

    Originally posted by: j.gann


    do you have a shared volume group configured? hacmp detects storage failure through loss of quorum errorlog entries (see "odmget errnotify").

    Joachim Gann
    #PowerHA-(Formerly-known-as-HACMP)-Technical-Forum
    #PowerHAforAIX


  • 3.  Re: Question on how disk heartbeat should work

    Posted Fri August 14, 2009 07:18 AM

    Originally posted by: RosieK


    Hi Philip,

    Does your rootvg also reside on the DS4k disk to which you have removed access?

    Regards

    Rosie Killeen
    IBM PowerHA Specialist
    #PowerHA-(Formerly-known-as-HACMP)-Technical-Forum
    #PowerHAforAIX


  • 4.  Re: Question on how disk heartbeat should work

    Posted Mon August 17, 2009 10:49 AM

    Originally posted by: Emittim3@Sirius


    Hey all! Thanks for the responses.

    Yes, I do have a shared volume group (concurrently varied on on both systems).

    My root volume group is on two internal SAS drives (AIX LVM mirrored).

    Here's a synopsis:

    root@sallreno:/ # lsvg -o
    ezpickvg
    rootvg
    root@sallreno:/ # lspv
    hdisk0 00caf8f42e098549 rootvg active
    hdisk1 00caf8f4cd821fb1 None
    hdisk2 00caf8f4c6db0134 ezpickvg concurrent
    hdisk3 00caf8f4c6db021d ezpickvg concurrent
    hdisk4 00caf8f4c6db031c ezpickvg concurrent
    hdisk5 00caf8f4c6db03ad ezpickvg concurrent

    root@sallren1:/ # lsvg -o
    rootvg
    root@sallren1:/ # lspv
    hdisk0 00cc1ce444b00ac9 rootvg active
    hdisk1 00cc1ce4cdf06756 rootvg active
    hdisk2 00caf8f4c6db0134 ezpickvg concurrent
    hdisk3 00caf8f4c6db021d ezpickvg concurrent
    hdisk4 00caf8f4c6db031c ezpickvg concurrent
    hdisk5 00caf8f4c6db03ad ezpickvg concurrent
    So the above shows the volume group ezpickvg that is on concurrently on each system.
    Here's the heartbeat networks:

    root@sallreno:/usr/es/sbin/cluster/utilities # ./clhbs
    NETWORK:en2 192.168.0.1 0 0
    NETWORK:rhdisk2 255.255.10.1 87 87
    UPTIME:411032

    Here's the topology:
    root@sallreno:/usr/es/sbin/cluster/utilities # ./cltopinfo
    Cluster Name: ezpick
    Cluster Connection Authentication Mode: Standard
    Cluster Message Authentication Mode: None
    Cluster Message Encryption: None
    Use Persistent Labels for Communication: No
    There are 2 node(s) and 2 network(s) defined

    NODE sallren1:
    Network ether1
    ezreno 10.252.164.50
    sallren1 192.168.0.2
    Network net_diskhb_01
    sallren1_hdisk2_01 /dev/hdisk2

    NODE sallreno:
    Network ether1
    ezreno 10.252.164.50
    sallreno 192.168.0.1
    Network net_diskhb_01
    sallreno_hdisk2_01 /dev/hdisk2

    Resource Group ezpickrg
    Startup Policy Online On Home Node Only
    Fallover Policy Fallover To Next Priority Node In The List
    Fallback Policy Never Fallback
    Participating Nodes sallreno sallren1
    Service IP Label ezreno

    Total Heartbeats Missed: 87
    Cluster Topology Start Time: 08/12/2009 11:32:55

    Here's the resource group:
    root@sallreno:/usr/es/sbin/cluster/utilities # ./clshowres

    Resource Group Name ezpickrg
    Participating Node Name(s) sallreno sallren1
    Startup Policy Online On Home Node Only
    Fallover Policy Fallover To Next Priority Node In The List
    Fallback Policy Never Fallback
    Site Relationship ignore
    Dynamic Node Priority
    Service IP Label ezreno
    Filesystems ALL
    Filesystems Consistency Check logredo
    Filesystems Recovery Method parallel
    Filesystems/Directories to be exported (NFSv2/NFSv3)/u/easy/ep2/controller /u/diskless
    Filesystems/Directories to be exported (NFSv4)
    Filesystems to be NFS mounted
    Network For NFS Mount
    Filesystem/Directory for NFSv4 Stable Storage
    Volume Groups ezpickvg
    Concurrent Volume Groups
    Use forced varyon for volume groups, if necessary false
    Disks
    GMVG Replicated Resources
    GMD Replicated Resources
    PPRC Replicated Resources
    ERCMF Replicated Resources
    SVC PPRC Replicated Resources
    Connections Services
    Fast Connect Services
    Shared Tape Resources
    Application Servers ezpickas
    Highly Available Communication Links
    Primary Workload Manager Class
    Secondary Workload Manager Class
    Delayed Fallback Timer
    Miscellaneous Data
    Automatically Import Volume Groups false
    Inactive Takeover
    SSA Disk Fencing false
    Filesystems mounted before IP configured true
    WPAR Name
    Run Time Parameters:

    Node Name sallreno
    Debug Level high
    Format for hacmp.out Standard

    Node Name sallren1
    Debug Level high
    Format for hacmp.out Standard
    #PowerHA-(Formerly-known-as-HACMP)-Technical-Forum
    #PowerHAforAIX


  • 5.  Re: Question on how disk heartbeat should work

    Posted Mon August 17, 2009 01:28 PM

    Originally posted by: Casey_B


    Hello Philip,

    To start with, a clarifying statement: Disk heartbeating is
    an additional communication path between nodes to remove the
    ip networking as a single point of failure.

    It's good that you don't have your rootvg on the SAN, this can
    be problematic as Rosie mentioned.

    You didn't answer Joachim's question, and I think that is the most important
    (In this case)

    AIX uses the error notify objects to run commands when there are
    new entries in the error report.

    PowerHA will use those error notify objects to fallover the cluster
    when there has been a loss of quorum in one of the volume groups.

    A particularity of LVM is that in addition to quorum, you also need to have
    one mirrored logical volume on the disks.

    So, take a look at lsvg datavg
    and lsvg -l datavg
    to see if you have quorum and also have a mirrored lv.

    If this doesn't match with your architecture, or design of the
    cluster...Then don't worry, there are always several different ways to
    architecture a system.

    You can use an application monitor to query some important files
    in your filesystem, and cause a fallover if the files are inaccessible,
    or not as you expect.

    Also, it is possible that the application monitor for your application
    will also help to determine if the disk has failed in some way.

    (For instance, you could have an application monitor that actually
    makes a connection to a database, and makes counts on some rows...)

    Hope this helps,
    Casey
    #PowerHA-(Formerly-known-as-HACMP)-Technical-Forum
    #PowerHAforAIX