AIX

AIX

Connect with fellow AIX users and experts to gain knowledge, share insights, and solve problems.

 View Only
Expand all | Collapse all

DLPAR operations from HMC hang on 6.1 TL4

  • 1.  DLPAR operations from HMC hang on 6.1 TL4

    Posted Wed December 02, 2009 05:30 AM

    Originally posted by: BOZY


    Having a strange issue with DLPAR operations lately on 6.1 TL4 (POWER6 9117-MMA machine). I was performing test DLPAR operations on various LPARs (AIX and VIOS) from HMC GUI and noticed that on certain LPARs DLPAR operations hang. What is happening is that when I am trying to add/remove CPU or memory, the operation hangs for 5 minutes or so and then returns an RMC error. Actual DLPAR operation completes correctly and quickly (checked CPU/memory amount change using topas_nmon), but after every such operation I have to reconfigure RMC by means of /usr/sbin/rsct/install/bin/recfgct for DLPAR operations to be possible again.

    When performing CPU operations:

    DLPAR ADD Processing resources failed: HMCERRV3DLPAR020: AIX DLPAR operation 'a' on cpu resource failed without throwing exception. The request number is 20. The success number is 0.
    The AIX command is:
    drmgr -c cpu -a -q 20 -p ent_capacity -w 5 -d 1

    The AIX standard output is:
    Network interruption occurs while RMC is waiting for the execution of the command on the partition to finish.
    Either the partition has crashed, the operation has caused CPU starvation, or IBM.DRM has crashed in the middle of the operation.
    The operation could have completed successfully. (40007 (null))
    The AIX standard error is:

    The RMC return code is 0. The AIX return code is 1141.


    When performing memory operations:

    DLPAR REMOVE Memory resources failed: The operating system prevented all of the requested memory from being removed. Amount of memory removed: 0 MB of 768 MB. The detailed output of the OS operation follows:
    Network interruption occurs while RMC is waiting for the execution of the command on the partition to finish.
    Either the partition has crashed, the operation has caused CPU starvation, or IBM.DRM has crashed in the middle of the operation.
    The operation could have completed successfully. (40007 (null))


    After any DLPAR operation on affected LPARs, next DLPAR operation produces following:

    DLPAR REMOVE Processing resources failed: HMCERRV3DLPAR005: RMC operation 'r' on cpu resource failed without throwing exception. The request number is 30. The success number is 0. The AIX standard output is:
    There is no active RMC session with the partition. (9117-MMA*06944E2*5)
    The AIX standard error is:

    The RMC return code is 1022. The AIX return code is 0.


    LPAR state from HMC (after DLPAR operation) looks as follows:

    hscpe@hmc2:~> lspartition -dlpar
    <#0> Partition:<5*9117-MMA*06944E2, dbsrv1-p6, 10.1.0.36>
    Active:<0>, OS:<AIX, 6.1, 6100-04-01-0944>, DCaps:<0x0>, CmdCaps:<0x0, 0x0>, PinnedMem:<785>
    <#1> Partition:<4*9117-MMA*06944E2, vios2, 10.1.0.33>
    Active:<0>, OS:<AIX, 6.1, 6100-04-01-0944>, DCaps:<0x0>, CmdCaps:<0x0, 0x1b>, PinnedMem:<551>
    ....


    Affected LPAR oslevel is 6100-04-01-0944, lppchk -v/-l/-c produces no output. Same behavior is shown by VIO Server LPAR, having ioslevel 2.1.2.10-FP-22. oslevel -s and lppchk on VIO shows inconsistencies, since some filesets are not correct level - have no idea why, and it does not seem to cause the problem, as exactly same ioslevel VIO (second one) works fine. HMC version is:
    hscpe@hmc2:~> lshmc -V
    "version= Version: 7
    Release: 3.5.0
    Service Pack: 0
    HMC Build level 20091112.1
    MH01195: Required fix for HMC V7R3.5.0 (10-16-2009)
    MH01197: Fix for HMC V7R3.5.0 (11-12-2009)
    ","base_version=V7R3.5.0
    "

    I noticed that all affected LPARs got one thing in common - LHEA ports attached, with etherchannel (either standard or 802.3ad) enabled:

    $ lsdev | grep -E '(lhea|EtherC)'
    ent3 Available EtherChannel / IEEE 802.3ad Link Aggregation
    lhea0 Available Logical Host Ethernet Adapter (l-hea)
    lhea1 Available Logical Host Ethernet Adapter (l-hea)


    After some experimenting I found out that (in my case) DLPAR operations exhibit 'hanging' behavior after creating an Etherchannel over LHEA. If I just attach LHEA ports to LPAR and assign IP address to it - DLPAR works fine; after creating Etherchannel - DLPAR hangs. Unconfiguring Etherchannel again and restarting LPAR does not help solving the problem. Other IP communications work fine through Etherchannel devices (I used ftp and NFS to check). Here are Etherchannel device properties:


    1. lsattr -El ent3
    adapter_names ent0,ent1 EtherChannel Adapters True
    alt_addr 0x000000000000 Alternate EtherChannel Address True
    auto_recovery yes Enable automatic recovery after failover True
    backup_adapter NONE Adapter used when whole channel fails True
    hash_mode default Determines how outgoing adapter is chosen True
    interval long Determines interval value for IEEE 802.3ad mode True
    mode standard EtherChannel mode of operation True
    netaddr 10.1.0.1 Address to ping True
    noloss_failover yes Enable lossless failover after ping failure True
    num_retries 25 Times to retry ping before failing True
    retry_time 1 Wait time (in seconds) between pings True
    use_alt_addr no Enable Alternate EtherChannel Address True
    use_jumbo_frame yes Enable Gigabit Ethernet Jumbo Frames True


    Anyone can try/confirm this behavior on their systems?
    I am currently in the process of finding out whether AIX 6.1 TL3 has same problem.

    Alex


  • 2.  Re: DLPAR operations from HMC hang on 6.1 TL4

    Posted Wed December 02, 2009 10:06 AM

    Originally posted by: bassemir


    Interesting problem and great job at providing details. I might have some time to experiment and see what happens on one of my systems. In the mean time here is a link to a developerworks paper I have found helpful in the past that talks about DLPAR and RMC.

    http://www.ibm.com/developerworks/systems/articles/DLPARchecklist.html

    I took a quick scan through it and did not see answer to your problem but thought I would share it anyway.

    If I understand your post, without etherchannel DLPAR works, adding etherchannel DLPAR fails, removing etherchannel it still fails. I would focus my attention to the IP configuration before and after the etherchannel change. Any changes to IP config, etc/hosts file, etc. When it is failing can you recover by running the rmcctl command?

    Sorry, no answers, just questions at this point.

    Rich


  • 3.  Re: DLPAR operations from HMC hang on 6.1 TL4

    Posted Thu December 03, 2009 02:07 AM

    Originally posted by: BOZY


    Thanks for the link, bassemir.
    Indeed I ran through some of procedures described in the checklist. Funny enough, some checks are assuming root access to HMC, which, I bet, is not too easy to arrange officially without breaking the warranty. I checked TL3 against this DLPAR problem and it's there, of course.

    Now, I was looking again at my own post with all details and stumbled over use_jumbo_frames section of Ethernet adapter properties. So, I brought ethernet adapter down and disabled jumbo frames on the Etherchannel (and single LHEA interfaces, shich were bundled to etherchannels before). IT WORKED NOW! It is not Etherchannel, really, it is jumbo frame setting which breaks things, at least in my case.
    Now, I had a look at the Cisco switch where LHEA interfaces are connected to, and noticed that global MTU setting was still 1500 bytes; change to configuration (set system mtu) was made to accommodate for 9000-byte frames, but switch must be reloaded to effect change. I cannot reload this switch, as there are non-redundant connections to servers on it, thus there is still an experiment pending - possibly jumbo frames will be OK for RMC if switch supports them. For the time being I applied "chdev -l ent3 -a use_jumbo_frame=no" to all affected LPARs. Jumbo frame setting on the LHEA port itself (in HMC) does not break anything, so other LPARs not needing RMC can use them.

    That's about it. Possibly this matter could be included in the DLPAR FAQ, as it is not obvious, and there is no direct sign of RMC failing - all necessay subsystems/processes are in place.

    Alex


  • 4.  Re: DLPAR operations from HMC hang on 6.1 TL4

    Posted Thu December 03, 2009 02:09 AM

    Originally posted by: BOZY


    oh, Firefox 3.5.5 seem to be producing errors with developerWorls pages, but eventually posted the message. My apologies for double-post...


  • 5.  Re: DLPAR operations from HMC hang on 6.1 TL4

    Posted Thu December 03, 2009 09:08 AM

    Originally posted by: bassemir


    Good work Bozy, thanks for the update... this is one way I learn.

    I seem to recall reading somewhere that if jumbo frames are used, it has to be configured by all adapters. But don't take that for fact. I am going to ask around and try and confirm. Over a year ago I was doing some Live Partition Mobility work and recall that the HMC also had to be configured for jumbo frames. But my memory is not my strongest skill and I will need to re-investigate. If I find something I will post it up.

    Thanks for telling us what you found!!!

    Rich


  • 6.  Re: DLPAR operations from HMC hang on 6.1 TL4

    Posted Fri December 04, 2009 05:33 PM

    Originally posted by: bassemir


    I have been able to recreate your problem simply by enabling jumbo frames on my LPAR. I also found a reference in the PMR database that HMC 7.3.4 did not support jumbo frames. I wanted to confirm that with HMC development but have not been able to do that yet.

    If jumbo frames are needed I suppose one could always add a second ethernet port without jumbo frames to handle the DLPAR operations.

    Rich


  • 7.  Re: DLPAR operations from HMC hang on 6.1 TL4

    Posted Fri December 04, 2009 05:33 PM

    Originally posted by: bassemir


    I have been able to recreate your problem simply by enabling jumbo frames on my LPAR. I also found a reference in the PMR database that HMC 7.3.4 did not support jumbo frames. I wanted to confirm that with HMC development but have not been able to do that yet.

    If jumbo frames are needed I suppose one could always add a second ethernet port without jumbo frames to handle the DLPAR operations.

    Rich


  • 8.  Re: DLPAR operations from HMC hang on 6.1 TL4

    Posted Sat December 05, 2009 04:39 AM

    Originally posted by: BOZY


    Latest HMC release seem to be having same issue. I am still at fault to understand how frame size (Layer 2) is affecting TCP connections, which RMC uses; to my understanding, RMC is sort of rsh/ssh type protocol. I created one LPAR and enabled jumbo frames on it again to test various services/protocols over TCP and UDP - could not find any problem at all.
    Creating separate adapter over physical ports seems to be an interesting idea, thanks (possibly, a separate VLAN, too!). I will try it.

    Alex


  • 9.  Re: DLPAR operations from HMC hang on 6.1 TL4

    Posted Thu December 03, 2009 02:12 AM

    Originally posted by: BOZY


    Thanks for the link, bassemir. Indeed I ran through some of procedures described in the checklist. Funny enough, some checks are assuming root access to HMC, which, I bet, is not too easy to arrange officially without breaking the warranty. I checked TL3 against this DLPAR problem and it's there, of course.

    Now, I was looking again at my own post with all details and stumbled over 'jumbo_frames'section of Ethernet adapter properties. So, I brought ethernet adapter down and disabled jumbo frames on the Etherchannel (and single LHEA interfaces, shich were bundled to etherchannels before). IT WORKED NOW! It is not Etherchannel, really, it is jumbo frame setting which breaks things, at least in my case. Now, I had a look at the Cisco switch where LHEA interfaces are connected to, and noticed that global MTU setting was still 1500 bytes; change to configuration (set system mtu) was made to accommodate for 9000-byte frames, but switch must be reloaded to effect change. I cannot reload this switch, as there are non-redundant connections to servers on it, thus there is still an experiment pending - possibly jumbo frames will be OK if switch supports them. For the time being I applied "chdev -l ent3 -a use_jumbo_frame=no" to all affected LPARs. Jumbo frame setting on the LHEA port itself (in HMC) does not break anything, so other LPARs not needing RMC can use them.

    That's about it, possibly this matter could be included in the DLPAR FAQ, as it is not an obvious matter.


  • 10.  Re: DLPAR operations from HMC hang on 6.1 TL4

    Posted Thu December 03, 2009 02:43 AM

    Originally posted by: Montecarlo


    Hi Alex,
    Your lpars with etherchannel on hea. Are these lpars set as the promiscuous mode lpars for their hea ports?
    http://publib.boulder.ibm.com/infocenter/powersys/v3r1m5/topic/iphb1/iphb1_vios_concepts_network_sea.htm
    Regards, Simon


  • 11.  Re: DLPAR operations from HMC hang on 6.1 TL4

    Posted Thu December 03, 2009 02:53 AM

    Originally posted by: BOZY


    No, I am avoiding promiscuous mode and serving Ethernet through VIO Server. I believe this is the whole idea behind LHEA.

    Alex


  • 12.  Re: DLPAR operations from HMC hang on 6.1 TL4

    Posted Thu December 03, 2009 03:09 AM

    Originally posted by: Montecarlo


    If I'm reading the documentation correctly, the vio server should be the promiscuous mode lpar for its HEA port when that port is used as the real adapter for an etherchannel.
    Regards, Simon


  • 13.  Re: DLPAR operations from HMC hang on 6.1 TL4

    Posted Thu December 03, 2009 05:52 AM

    Originally posted by: BOZY


    Well, my understanding of what they have written is that you need to put LHEA in promiscuous mode on VIO server only if you will be providing vitrual ethernets for other LPARs through VIO. What I am doing is allocating same pair of LHEA ports to several LPARs (including VIO) and let 'special' processor dedicated to LHEA interface handle the network load.

    Alex


  • 14.  Re: DLPAR operations from HMC hang on 6.1 TL4

    Posted Thu December 03, 2009 01:05 PM

    Originally posted by: SystemAdmin


    For dlpar operation the server has be be on line and network. RMC connection goes thru the tcpip nw and not thru the private connection of hmc to the FSP.

    1. Please see if you get any output with the command # lsrsrc IBM.ManagementServer

    the expected result should be something like :
    Resource Persistent Attributes for IBM.ManagementServer
    resource 1:
    Name = "x.x.xx.x"
    Hostname = "x.x.xx.x"
    ManagerType = "HMC"
    LocalHostname = "10.x.x.x"
    ClusterTM = "9078-160"
    ClusterSNum = ""
    ActivePeerDomain = ""
    NodeNameList = {"abc"}

    2. see if the command # lssrc -a|grep rs
    gives the process

    IBM.DRM rsct_rm 901270 active

    if the IBM.DRM is showing as not operating or not showing up then
    issue:

    /usr/sbin/rsct/install/bin/uncfgct -n
    /usr/sbin/rsct/install/bin/cfgct

    install or reinstall csm.core and csm.client.


  • 15.  Re: DLPAR operations from HMC hang on 6.1 TL4

    Posted Thu December 10, 2009 05:17 PM

    Originally posted by: bassemir


    I understand from an HMC developer that jumbo frame support on the HMC starts with HMC V7R350. If you have a version of HMC prior to that you might need to have two ethernet ports into your LPAR. One with support for jumbo frames and a second without support for jumbo frames for the HMC to use for dlpar operations.

    Rich


  • 16.  Re: DLPAR operations from HMC hang on 6.1 TL4

    Posted Thu June 02, 2011 06:33 PM

    Originally posted by: SteveShapiro


    I don't know if anyone will see this but the question is regarding the post from Dec 10, 2009 and we are coming across this problem in June 2011.
    The question is really a follow up...
    If you had an HMC with the code level that supports Jumbo Frames and you set up a second public adapter with Jumbo Frames, how do you tell RMC in the LPAR with Jumbo Frames to talk to the Interface on the HMC with Jumbo Frames? Does it need to be on a separate subnet, a separate VLAN? Is there a way to do this?

    Steve