Originally posted by: BOZY
Having a strange issue with DLPAR operations lately on 6.1 TL4 (POWER6 9117-MMA machine). I was performing test DLPAR operations on various LPARs (AIX and VIOS) from HMC GUI and noticed that on certain LPARs DLPAR operations hang. What is happening is that when I am trying to add/remove CPU or memory, the operation hangs for 5 minutes or so and then returns an RMC error. Actual DLPAR operation completes correctly and quickly (checked CPU/memory amount change using topas_nmon), but after every such operation I have to reconfigure RMC by means of /usr/sbin/rsct/install/bin/recfgct for DLPAR operations to be possible again.
When performing CPU operations:
DLPAR ADD Processing resources failed: HMCERRV3DLPAR020: AIX DLPAR operation 'a' on cpu resource failed without throwing exception. The request number is 20. The success number is 0.
The AIX command is:
drmgr -c cpu -a -q 20 -p ent_capacity -w 5 -d 1
The AIX standard output is:
Network interruption occurs while RMC is waiting for the execution of the command on the partition to finish.
Either the partition has crashed, the operation has caused CPU starvation, or IBM.DRM has crashed in the middle of the operation.
The operation could have completed successfully. (40007 (null))
The AIX standard error is:
The RMC return code is 0. The AIX return code is 1141.
When performing memory operations:
DLPAR REMOVE Memory resources failed: The operating system prevented all of the requested memory from being removed. Amount of memory removed: 0 MB of 768 MB. The detailed output of the OS operation follows:
Network interruption occurs while RMC is waiting for the execution of the command on the partition to finish.
Either the partition has crashed, the operation has caused CPU starvation, or IBM.DRM has crashed in the middle of the operation.
The operation could have completed successfully. (40007 (null))
After any DLPAR operation on affected LPARs, next DLPAR operation produces following:
DLPAR REMOVE Processing resources failed: HMCERRV3DLPAR005: RMC operation 'r' on cpu resource failed without throwing exception. The request number is 30. The success number is 0. The AIX standard output is:
There is no active RMC session with the partition. (9117-MMA*06944E2*5)
The AIX standard error is:
The RMC return code is 1022. The AIX return code is 0.
LPAR state from HMC (after DLPAR operation) looks as follows:
hscpe@hmc2:~> lspartition -dlpar
<#0> Partition:<5*9117-MMA*06944E2, dbsrv1-p6, 10.1.0.36>
Active:<0>, OS:<AIX, 6.1, 6100-04-01-0944>, DCaps:<0x0>, CmdCaps:<0x0, 0x0>, PinnedMem:<785>
<#1> Partition:<4*9117-MMA*06944E2, vios2, 10.1.0.33>
Active:<0>, OS:<AIX, 6.1, 6100-04-01-0944>, DCaps:<0x0>, CmdCaps:<0x0, 0x1b>, PinnedMem:<551>
....
Affected LPAR oslevel is 6100-04-01-0944, lppchk -v/-l/-c produces no output. Same behavior is shown by VIO Server LPAR, having ioslevel 2.1.2.10-FP-22. oslevel -s and lppchk on VIO shows inconsistencies, since some filesets are not correct level - have no idea why, and it does not seem to cause the problem, as exactly same ioslevel VIO (second one) works fine. HMC version is:
hscpe@hmc2:~> lshmc -V
"version= Version: 7
Release: 3.5.0
Service Pack: 0
HMC Build level 20091112.1
MH01195: Required fix for HMC V7R3.5.0 (10-16-2009)
MH01197: Fix for HMC V7R3.5.0 (11-12-2009)
","base_version=V7R3.5.0
"
I noticed that all affected LPARs got one thing in common - LHEA ports attached, with etherchannel (either standard or 802.3ad) enabled:
$ lsdev | grep -E '(lhea|EtherC)'
ent3 Available EtherChannel / IEEE 802.3ad Link Aggregation
lhea0 Available Logical Host Ethernet Adapter (l-hea)
lhea1 Available Logical Host Ethernet Adapter (l-hea)
After some experimenting I found out that (in my case) DLPAR operations exhibit 'hanging' behavior after creating an Etherchannel over LHEA. If I just attach LHEA ports to LPAR and assign IP address to it - DLPAR works fine; after creating Etherchannel - DLPAR hangs. Unconfiguring Etherchannel again and restarting LPAR does not help solving the problem. Other IP communications work fine through Etherchannel devices (I used ftp and NFS to check). Here are Etherchannel device properties:
-
lsattr -El ent3
adapter_names ent0,ent1 EtherChannel Adapters True
alt_addr 0x000000000000 Alternate EtherChannel Address True
auto_recovery yes Enable automatic recovery after failover True
backup_adapter NONE Adapter used when whole channel fails True
hash_mode default Determines how outgoing adapter is chosen True
interval long Determines interval value for IEEE 802.3ad mode True
mode standard EtherChannel mode of operation True
netaddr 10.1.0.1 Address to ping True
noloss_failover yes Enable lossless failover after ping failure True
num_retries 25 Times to retry ping before failing True
retry_time 1 Wait time (in seconds) between pings True
use_alt_addr no Enable Alternate EtherChannel Address True
use_jumbo_frame yes Enable Gigabit Ethernet Jumbo Frames True
Anyone can try/confirm this behavior on their systems?
I am currently in the process of finding out whether AIX 6.1 TL3 has same problem.
Alex