Automation with Power

Power Business Continuity and Automation

Connect, learn, and share your experiences using the business continuity and automation technologies and practices designed to ensure uninterrupted operations and rapid recovery for workloads running on IBM Power systems. 


#Power
#TechXchangeConferenceLab

 View Only
  • 1.  JFS2 Mountguard & PowerHA

    Posted Thu September 18, 2014 03:18 PM

    Originally posted by: RussellAdams


    I've seen very basic material on JFS2 Mountguard that describes essentially what it does. PowerHA now depends on it as an added protection from partitioned clusters.

     

    Has anyone seen details on the algorithm used by mount guard, and how HA leverages that specifically?


    #PowerHAforAIX
    #PowerHA-(Formerly-known-as-HACMP)-Technical-Forum


  • 2.  Re: JFS2 Mountguard & PowerHA

    Posted Sat September 20, 2014 02:48 AM

    Originally posted by: j.gann


    maybe you mean this doc:

    http://www-01.ibm.com/support/docview.wss?uid=isg3T1018853

    From the semantics I suspect mountguard uses the s_state flag in the JFS2 superblock (indicating if FS is mounted/dirty/clean).

    Check with fsdb and jfs2 system header files if you're curious...

    Mind you: mountguard does NOT protect against partitioned clusters. RG takeover always will mount with "-o noguard". Or how would a node taking over a RG from a crashed node be able to mount the shared (and guarded) FS?

    It merely prevents administrators from accidentally mounting an already mounted FS on a different node (good thing).

    regards
    Joachim Gann


    #PowerHA-(Formerly-known-as-HACMP)-Technical-Forum
    #PowerHAforAIX


  • 3.  Re: JFS2 Mountguard & PowerHA

    Posted Sat September 20, 2014 06:51 AM

    Originally posted by: RussellAdams


    That was my thought as well, that MG was simply a flag which helps prevent administrator level remounting.

    A flag wouldn't help in the case of cluster partitioning. HA is only setting it to prevent an admin problem.

    If only CAA hadn't eliminated disk heartbeating.

    Thanks.


    #PowerHA-(Formerly-known-as-HACMP)-Technical-Forum
    #PowerHAforAIX


  • 4.  Re: JFS2 Mountguard & PowerHA

    Posted Sun September 28, 2014 01:07 PM

    Originally posted by: POWERHAguy


    Actually the repository disk does provide a disk heartbeat type function. JFS2 MG support, though backported to 6.1, wasn't introduced until the corresponding AIX levels provided it (6.1.7/7.1.1).  It uses both s_flag, if set, then checks s_state.

    I thought it was to prevent a takeover from succeeding in the event of a split because I'm pretty sure that's how it was explained to me. That was the impression I had, better (or worse) than that I have explained it as such. So if I'm mistaken, well that just stinks to put it politely.

    Also AIX 6.1.8/7.1.2 added similiar feature at the vg, if its already varied on, level. But I've had it tell me I had to use the -O on varyonvg to override as it was varied on somewhere else, when it wasn't. But in 6.1.9/7.1.3 I don't see it as often so I think its better there.


    #PowerHA-(Formerly-known-as-HACMP)-Technical-Forum
    #PowerHAforAIX


  • 5.  Re: JFS2 Mountguard & PowerHA

    Posted Mon September 29, 2014 09:26 AM

    Originally posted by: RussellAdams


    Actually the repository disk does provide a disk heartbeat type function.

    The documentation does indicate that if other heartbeat methods (ie: TCP/IP and SAN TME) fail that the repository disk will also be used as a method for internode communication. That means even if you don't have TME that you have two methods of heartbeat (ie: IP and Non-IP).

    If you know any details on how mountguard could prevent a split, I am very curious. On the other hand if your JFS2 filesystems are on the same SAN as the repository, it's a moot point. If you can read the state flags, you should be able to communicate on the repository disk.


    #PowerHA-(Formerly-known-as-HACMP)-Technical-Forum
    #PowerHAforAIX


  • 6.  Re: JFS2 Mountguard & PowerHA

    Posted Mon September 29, 2014 10:18 AM

    Originally posted by: POWERHAguy


    Well it doesn't PREVENT it from occuring, however I thought it prevent a successful acquisition by the takeover node. I assumed it fail on resource group acquisition. But if it overrides it, that sorta defeats the purpose. But having said that, if a system crash occurs I don't know that it removes the s_flag and s_state or not. If it does, then that easily explains why a normal takeover would work. If it doesn't, then the override would be requied.

    Now I tried testing this over the weekend and having a bit of difficulty in my lab environment causing a true split, meaning they can still talk to something else on the IP network, just not each other.  I did this on HA 6.1 SP12 without diskhb. Even with a netmon.cf configured I had catastrophic results in that I got gs_child failure on gsclvmd and both nodes crashed... :(. I've done it twice, with the same catastrophic results.

     

     

     


    #PowerHA-(Formerly-known-as-HACMP)-Technical-Forum
    #PowerHAforAIX


  • 7.  Re: JFS2 Mountguard & PowerHA

    Posted Mon September 29, 2014 11:48 AM

    Originally posted by: POWERHAguy


    Well when in doubt, go to the source. I contacted HA dev and asked them to review this post. Tom was kind enough to reply to me with the following details:

    • Mount guard is turned on, on all the AIX releases that support it.   However, it neither prevents partitioned clusters, nor does it save the user from their consequences:  The PowerHA code will always override mount guard when it goes to mount a file system.  This is needed to be able to do takeover.
    • Mount guard is overriden by running logredo prior to mount.   This is run against all jfslog and jfs2log logical volumes.  See /usr/es/sbin/cluster/events/get_disk_vg_fs.
    • Mount guard, at best, like the logically equivalent volume group varyon check, keeps the administrator out of trouble when they accidentally try to vary on/mount things that they should not.  However, as the override mechanisms become widely known, I suspect that people will just always use them (the way PowerHA does), and they will end up being like the SMIT warnings that everyone clicks through and ignores.
    • While CAA does provide heart beating through the repository disk, in general, that is used only when all the TCP networks have failed.   So, in general, it does not provide much protection against cluster partition, either.  "lscluster -i" will typically show dpcom as "restricted".


    A customer who wants the best protection against a partitioned cluster should go to 7.1.3.1, and configure the appropriate split/merge options.
     


    #PowerHA-(Formerly-known-as-HACMP)-Technical-Forum
    #PowerHAforAIX