IBM i Global

 View Only
Expand all | Collapse all

IBMi - PowerHA approach for OS-interventions

  • 1.  IBMi - PowerHA approach for OS-interventions

    Posted Thu October 10, 2024 08:53 AM

    Hello all,

    I have to manage a new environment with PowerHA enabled, and I'm wondering if my the cluster operations are fine to perform some OS-interventions, like full system-backups, IPL, OS upgrades 

    Cluster setup :

                   3 nodes in the same Device Domain

    (                             -PS-partition

                                  -PT-partition

                                  -Flash Partition

                   CRGs

    • Active - Dev CRG à pointing to IASP Device (with recovery domain set to PS & PT partition)
    • Inactive - Data CRG à pointing to Flashcopy node  (but inactive, à using the lab-services toolkit to perform the FC operation on IASP level - QZRDHASM)
    • Active – Peer CRG à Application id : QZRDPWRHA.DLTPRFCLU  

    Now, what's the best approach to cover OS operations, like full backup, IPL, or OS upgrades ...

    Which actions to deliver on Cluster/PowerHA level on the IBMi to prepare a partition for restricted state ...

    Flash Partition:

    Putting this Flash partition in restricted state to cover these OS related interventions ...

                   On Flash partition:

                                  ENDCRG of the Peer CRG

                                  ENDCLUNOD  NODE(FlashCopy node)

                   And after the OS-intervention:

                                  STRCLUNOD NODE(FlashCopy node)

                                  STRCRG of the Peer CRG

    PT Partition:

    Putting this PT partition in restricted state to cover these OS related interventions ...

                   On PT partition:

                                  ENDCRG of the Peer CRG

    ENDCRG of the DEV CRP (IASP)

                                  ENDCLUNOD  NODE(PT node)

                   And after the OS-intervention:

                                  STRCLUNOD NODE(PT node)

    STRCRG of the DEV CRG (IASP)

                                  STRCRG of the Peer CRG

                                  Plus verify the replication is active and in-sync  (DSPSVCSSN  SSN(xxxx) DEVDMN(Dev-domain)

                                  (no DETACH / REATTACH operation will be done !)

    PS Partition:

    Putting this PS partition in restricted state to cover these OS related interventions ... (after switching IASP to PT partition)

                   On PS partition:

                                  Verify the replication is active and in-sync  (DSPSVCSSN  SSN(xxxx) DEVDMN(Dev-domain)

                                                                And 'Switchover Reverse replication' is set to *YES

    CHGCRGPRI   DEV-CRG (IASP)

      • Production env gets opened on PT partition.

    Further on on PS partition...

    ENDCRG of the Peer CRG         

    ENDCRG of the DEV CRG (IASP)

                                  ENDCLUNOD  NODE(PS node)

                   And after the OS- intervention:

                                  STRCLUNOD NODE(PS node)

    STRCRG of the DEV CRG (IASP)

                                  STRCRG of the Peer CRG

                                  Plus verify the replication is active and in-sync  (DSPSVCSSN  SSN(xxxx) DEVDMN(Dev-domain)

                                  (no DETACH / REATTACH operation will be done !)

    CHGCRGPRI   DEV-CRG (IASP) back to PS partition

      • Production env gets opened on PS partition.

    Is this approach fine ?

    Should I include an end/start operation of the Admin Domain ?

    All these commands can be executed on the related partition ?

    Is my view on the cluster operations fine ... or are additional actions required.

    Any recommendation is welcome.

    Thanks for your feedback,

    Jos



    ------------------------------
    Jos (Jozef) Thijs
    Kyndryl
    ------------------------------


  • 2.  RE: IBMi - PowerHA approach for OS-interventions

    Posted Fri October 11, 2024 01:30 PM

    Hi,

    These are some excellent questions. Your steps look fairly similar to what I would expect. Since every environment can be unique, I always remind everyone to keep in mind that some suggestions may vary from environment to environment.

    PowerHA Automated Management of the Administrative Domain

    The first thing that comes to mind is regarding the Peer CRG QZRDPWRHA.DLTPRFCLU. Over the years, several aspects of the IBM services toolkit have been incorporated directly into the PowerHA product. Starting in IBM i 7.4, PowerHA now includes policies for automating the management of the administrative domain. You can find our more about these policies here: https://helpsystemswiki.atlassian.net/wiki/spaces/IWT/pages/2148040707/Administrative+Domain+PowerHA+Policies

    IBM does include some documentation for cleanup: https://www.ibm.com/support/pages/node/7105341 .

    PowerHA Integrated FlashCopy Automation

    Similarly, the Data CRG that is used for FlashCopy management, may also be something that you could look to transition to native PowerHA commands as the PowerHA commands for FlashCopy (STRSVCSSN of type *FLASHCOPY, or using CHGSVCSSN with the *RESUME option) include the following parameters:

    • Source ASP action (to quiesce)
    • Verification of replication state if the FlashCopy is at the target of Replication
    • Target ASP action (to automatically vary on the FlashCopy target)
    • Target exit program to submit a customized exit program

    Some additional information on this is available at the following page: https://helpsystemswiki.atlassian.net/wiki/spaces/IWT/pages/2269970433/7.5+HA+5.2.2+7.4+HA+4.8.2+and+7.2+HA+3.10+PTFs#Integrated-FlashCopy-Automation-Enhancements-7.5-HA-5.2.2-7.4-HA-4.8.2-7.2-HA-3.10

    If you did switch to native PowerHA for both of these, that would likely simplify the environment down to the device CRG which may simplify some of your steps above.

    Disabling Automatic Failover

    I see in your environment you are ending the device CRG before performing maintenance, which is definitely best practice if you want to avoid accidental unplanned failovers. One other piece of protection you can put in place is to disable automatic failovers. This functionality was provided at IBM i 7.4 and previously required the lab services toolkit: https://helpsystemswiki.atlassian.net/wiki/spaces/IWT/pages/327778312/QCST_CRG_CANCEL_FAILOVER+PowerHA+policy#Example-2---Disabling-all-listed-automatic-failovers-for-a-CRG

    Regarding your final question of the administrative domain, typically, I recommend leaving the administrative domain active, even when doing maintenance on one of the nodes unless you're doing something where you know you're going to want to discard changes on a particular node.

    Hopefully, this information is useful.



    ------------------------------
    Thanks,
    Brian Nordland
    Director of Development at Fortra
    ------------------------------



  • 3.  RE: IBMi - PowerHA approach for OS-interventions

    Posted Fri October 11, 2024 02:41 PM

    Hi Jos,

    I think what you have laid out is "fine" but there is also a level of "it depends" with your environment.

    Depending on the level of IBMi OS you're operating in, the mention of QZRDPWRHA may no longer be in service....that's a side topic.

    As for full system backups, again depending on if you're wanting to incorporate IASP data into that backup will determine certain actions. 

    Ultimately if you're replicating between PS and PT and then taking a flashcopy, is the flashcopy occurring from PS or PT as the flashcopy source?
    Regardless, as long as you're flashing the IASP data, then the entire system save of the flash node is a good recovery point of the IASP data, valid as of the time of the flashcopy.  LIC and OS are subjective depending on how your role out PTFs.

    When saving the PT, are you wanting IASP data saved also?  If so, then you'd have to plan for a "Detach" of the replication between PS and PT before the backup occurs.

    When saving the PS, is there really a need short of needing to recover SYSBAS?  If you're flashing your critical IASP data to a flashcopy node, then saving that IASP data is probably the most critical aspect.  The SYSBAS portion becomes less of a concern in a recovery effort since you could really pull SYSBAS from the flash node or the PT node at any point.  

    Every environment is different and customer needs are different.  Everything all depends on the the ultimate end goal on what you're wanting to achieve for each system/node in the environment.

    To finish answering your questions, as long as there is 1 node active in the cluster/admin domain, there is no need to ENDCAD/STRCAD from the node performing the backup.

    I'm more than happy to try and provide further feedback if you have additional specific questions.

    Ben Rabe



    ------------------------------
    Ben Rabe
    ------------------------------



  • 4.  RE: IBMi - PowerHA approach for OS-interventions

    Posted 22 days ago

    Brian and Ben, thanks for your valuable feedback.

    Another question ... related to the switching of the iASP ... 

    In this environment, I see 10000 (and more) devices which are not reported anymore in SST. In fact, disk units (DMPxxxx) unknown. I can perform a manual cleanup of all these 'failed / non-reporting devices in Service tools ... but is there no program or command available which I can run regularly to perform a nice cleanup in SST ?

    Thanks,

    Kind regards,

    Jos



    ------------------------------
    Jos (Jozef) Thijs
    Kyndryl Belgium
    ------------------------------



  • 5.  RE: IBMi - PowerHA approach for OS-interventions

    Posted 21 days ago

    Hello Jozef, there is no command in the base/native IBMi OS to perform this type of clean up easily at this time. 
    However, IBM Technology Expert Labs offers a tool library referred to as Smart Assist (library QZRDPWRHA).
    In their tool set, they have a command called QZRDPWRHA/CLRHDWRSC with the following Text Description "Clear non-reporting resources."

    Additionally, there is an Advanced Analysis macro -> iohridebug -removenonreportingdisks that can be used instead of having to do the manual removal from Hardware Service Manager.

    Hope this helps a little bit.



    ------------------------------
    Ben Rabe
    ------------------------------



  • 6.  RE: IBMi - PowerHA approach for OS-interventions

    Posted 21 days ago

    Hello Jos

    You may want to try QMGTOOLS/RUNAA command (https://www.ibm.com/support/pages/qmgtools-run-aa-macros). There is an *IOHRIDEBUG value for REQUEST parameter. With -removenonreportingdisks for DATA parameter, it might perform what you are looking for.



    ------------------------------
    Marc Rauzier
    ------------------------------



  • 7.  RE: IBMi - PowerHA approach for OS-interventions

    Posted 21 days ago

    @Marc, that's a good idea.....but:
    QMGTOOLS/RUNAA RQS(*IOHRIDEBUG) DATA('-removenonreportingdisks')
    Requested advanced analysis command IOHRIDEBUG is not supported.

    Seems we do not allow that to be performed at this time.



    ------------------------------
    Ben Rabe
    ------------------------------



  • 8.  RE: IBMi - PowerHA approach for OS-interventions

    Posted 21 days ago

    I should add that QMGTOOLS/RUNAA macros are meant to be used for data collection and not performing tasks to the system, such as removing failed/non-reporting resources.
    We could consider in the future having some option that could do this, or the IDEA could be raised to request an OS command to run this interactively from a command line.
    https://ibm-data-and-ai.ideas.ibm.com/



    ------------------------------
    Ben Rabe
    ------------------------------



  • 9.  RE: IBMi - PowerHA approach for OS-interventions

    Posted 21 days ago

    That's a good idea to raise an IDEA! At the same time, it could be interesting to request an OS command to reset disks multipaths. I remember loosing tons of time to do it manually in the past. I am retired now for a couple of years but, in the Kyndryl (or IBM SO/ITD/Infrastructure Services) context, using an SST user profile involves connecting to a shared password vault, requesting the password with a valid reason, connecting to SST, changing the password in both vault and SST, then run the multipath resetter macro. And if your OS profile does not have *SERVICE special authority, you have to perform this procedure for another profile (such as QSECOFR) which has it. Hopefully, they found now a way to change this password automatically from the vault.

    PS: wasn't there a RUNAA2 command in the past with more capabilities than RUNAA? Or, is my memory getting in trouble?

    (Jos, I hope that everything is running fine for you and the IBM i team within Kyndryl)



    ------------------------------
    Marc Rauzier
    ------------------------------



  • 10.  RE: IBMi - PowerHA approach for OS-interventions

    Posted 20 days ago

    Marc, your memory does serve you well and there is a RUNAA2.  However, in order to run that you still have to qualify with an OS user id/pw/confirm pw as well as SST user id/pw/confirm pw.
    But I did test that too and it did allow me to run the macro.

    > QMGTOOLS/RUNAA2 OS400USR(BRABE) OS400PWD() OS400PWD2() SSTUSR(QSECOFR) SSTPWD() SSTPWD2() AAMACRO1(IOHRIDEBUG) AAOPT1('-removenonreportingdisks') 
      Request completed                                                        

    Upon completion I went out and checked and all of my failed/non-reporting resources were cleaned up and the only things left behind were resources with an Unknown status.

    As for your other issue with running Multipathresetter, that's an entirely different macro and I will just say that it depends on why you had to run it as well as what technologies are being used, but that macro is already called by default in PowerHA as needed and by PowerHA Tools.
    PowerHA tools also offers a command to run a reset from command line....and also, if there were a valid need to run it from a native OS command, an IDEA submission would be the next logical request.



    ------------------------------
    Ben Rabe
    ------------------------------



  • 11.  RE: IBMi - PowerHA approach for OS-interventions

    Posted 20 days ago

    Thanks for confirming and testing RUNAA2. Maybe, Jos will have the opportunity to use it (in conjunction with CHGSSTUSR command to change the password, as it is required to do so within Kyndryl, every time a shared password is used).

    Regarding the need to reset disks multipaths, I have again to call my memory!
    I remember now that, thanks to PowerHA Tools that we set up a couple of years before my retirement for SVC like storage devices, we could leverage the reset command they provide. But during our first tests then subsequent production installs, we were using an homemaiden tool to provide similar functions for FullSystem FlashCopy and FullSystem Replication with DS8k devices. This tool is as old as V5R4M5 and was still in use until V7R3. And at each IPL following a FlashCopy or a role swap after a fail over pprc, we were stuck with those C6004508/C600450A SRC codes during 10-15 minutes, hoping that everything was fine. Only a multipath reset as soon as possible after this first IPL was removing this long IPL step. Even further IPLs without multipath reset were not removing it. The fact was that there was always the same number of paths for each disk on the target than on the source, but my understanding was that the IPL step was not happy with the resources serial number or any hardware difference. And so, it was waiting for those it had at previous IPL (on source system) to come back, until it timed out, agreeing finally those it had.

    Sorry for this long post but, at least, my memory had some work today :-)



    ------------------------------
    Marc Rauzier
    ------------------------------



  • 12.  RE: IBMi - PowerHA approach for OS-interventions

    Posted 19 days ago

    Don't think it makes a difference, but I use the Power link for Power-related Ideas: https://ibm-power-systems.ideas.ibm.com/



    ------------------------------
    Glen Corneau
    ------------------------------



  • 13.  RE: IBMi - PowerHA approach for OS-interventions

    Posted 21 days ago

    Hello Ben
    Bad luck and thanks for checking.



    ------------------------------
    Marc Rauzier
    ------------------------------



  • 14.  RE: IBMi - PowerHA approach for OS-interventions

    Posted 12 days ago

    Marc, Ben, 

    Thanks very much for this feedback. 

    In fact, I have now 3 options : 

    • within SST, using the Advanced Analysis macro   IOHRIDEBUG -removenonreportingdisks  
    • using this IOHRIDEBUG macro through the RUNAA2 command in qmgtools.
    • Or use the command  CLRHDWRSC in the smartassist library QZRDPWRHA

    All 3 options working fine.

    Thx very much

    Jos



    ------------------------------
    Jos (Jozef) Thijs
    Kyndryl Belgium
    ------------------------------