Originally posted by: SystemAdmin
We got the following dump analysis results from IBM.
Changed product name to A product.
ENVIRONMENT:
9119 83E4F4C
bos.mp64:5.2.0.85
.
PROBLEM:
System hang.
Telnet, HMC console login failed.
Ping OK
.
ACTION TAKEN:
It seems that the system hang was caused by A product.
Refer to the follwing URL for the product info.
http://www..... So many inetd threads are blocked by A product kernel extension.
Probably those inetd threads are on being forked by telnet connection requests.
.
(0)> stat
SYSTEM_CONFIGURATION:
CHRP_SMP_PCI POWER_PC POWER_5 machine with 1 available CPU(s) (64-bit registers)
SYSTEM STATUS:
sysname... AIX
.
.
.
.
.
xmalloc debug: disabled
(0)> status
short read of 4063 bytes for coff
0 8802D 136 760FA 118 rmcd
0x1 is a vnode
(0)> f
pvthread+008800 STACK:
0000B104.unlockl+000004 ()
045F2AACA product:A product:045F2AAC+000000 () <-- A product SecureOS
045F2950A product:A product:045F2950+000000 ()
04601DE0A product:A product:04601DE0+000000 ()
0000379Csc_msr_2_point+000028 ()
Not a valid dump data area @ 2FF22500
(0)> lke 045F2A74
ADDRESS FILE FILESIZE FLAGS MODULE NAME
1 08089800 045F20E0 0003A520 00080262 /usr/lib/drivers/A product <-- A product SecureOS
<snip>
(0)> th -r | egrep -v wait | grep pvthread | wc -l
272 <-- # of WCPU threads, which are waiting for CPU
(0)> th -r | egrep inetd | wc -l
116 <-- 116 threads of 272 WCPU threads are inetd.
(0)> th -r
SLOT NAME STATE TID PRI RQ CPUID CL WCHAN
<snip>
pvthread+007600 118>getty RUN 07605B 03C 0 0
<snip>
pvthread+026E00 622>java RUN 26E0DD 03C 0 0
pvthread+02A900 681>inetd RUN 2A9053 03C 0 0
pvthread+02AF00 687>inetd RUN 2AF05F 03C 0 0
pvthread+02C700 711>inetd RUN 2C708F 03C 0 0
<snip>
(0)> f 622 <-- java
pvthread+026E00 STACK:
0016179Cklockl+0003C4 (000000000016179C, 80000000000010B2,
0000000022422282
??)
045F2A74A product:A product:045F2A74+000000 ()
045F2950A product:A product:045F2950+000000 ()
045FFC04A product:A product:045FFC04+000000 ()
0000379Csc_msr_2_point+000028 ()
Not a valid dump data area @ 36F8B064
(0)> f 681 <-- inetd
pvthread+02A900 STACK:
0016179Cklockl+0003C4 (000000000000B168, 80000000000090B2,
000000000000B100
??)
045F2A74A product:A product:045F2A74+000000 ()
045F2950A product:A product:045F2950+000000 ()
04601DE0A product:A product:04601DE0+000000 ()
0000379Csc_msr_2_point+000028 ()
Not a valid dump data area @ 2FF22610
<-- Almost all of WCPU threads are blocked by A product.
.
BCGONG0610DUMP
IBM's dump analysis is as follows.
1) Dump analysis
-
When the administrator got the systme dump, A product was executing UNLOCK and 77 threads were in WCPU state waiting for CPU resource.
-
This means CPU resource was in extremely competitive demand.
-
Seeing each stack of 77 treads in WCPU state, we can find out they were waiting for LOCK from A product for using CPU.
-
With all these in mind, most of treads in WCPU state were waiting for CPU LOCK in extremely competitive situation causing system hangup.
-
Since A product was occupying LOCK for most treads when the crash happened, support from A product side is needed.
-
This result is based on the analysis by IBM in the US.
I have questions the above analysis as follows.
1.
(0)> f
pvthread+008800 STACK:
0000B104.unlockl+000004 ()
045F2AACA product:A product:045F2AAC+000000 () <-- A product SecureOS
045F2950A product:A product:045F2950+000000 ()
04601DE0A product:A product:04601DE0+000000 ()
0000379Csc_msr_2_point+000028 ()
Not a valid dump data area @ 2FF22500
I'd like to know whether A product was doing something else after executing UNLOCK, or this is when A product was executing UNLOCK.
2. You said 77 threads were in WCPU state. I want to know how you could recognize those were in WCPU state.
Executing th -r, I got all processes in RUN state.
=============================================================================
(0)> th -r
SLOT NAME STATE TID PRI RQ CPUID CL WCHAN
pvthread+000100 1>init RUN 001003 03C 0 0
pvthread+000200 2>wait RUN 002005 0FF 0 00000 0
pvthread+000300 3!wait RUN 003007 0FF 0 00000 0
pvthread+000400 4!wait RUN 004009 0FF 0 00000 0
pvthread+000500 5!wait RUN 00500B 0FF 0 00000 0
pvthread+000600 6!wait RUN 00600D 0FF 0 00000 0
pvthread+000700 7!wait RUN 00700F 0FF 0 00000 0
pvthread+000800 8!wait RUN 008011 0FF 0 00000 0
pvthread+000900 9!wait RUN 009013 0FF 0 00000 0
pvthread+000A00 10!wait RUN 00A015 0FF 0 00000 0
pvthread+000B00 11!wait RUN 00B017 0FF 0 00000 0
pvthread+000C00 12!wait RUN 00C019 0FF 0 00000 0
pvthread+000D00 13!wait RUN 00D01B 0FF 0 00000 0
pvthread+000E00 14!wait RUN 00E01D 0FF 0 00000 0
pvthread+000F00 15!wait RUN 00F01F 0FF 0 00000 0
pvthread+001000 16!wait RUN 010021 0FF 0 00000 0
pvthread+001100 17!wait RUN 011023 0FF 0 00000 0
pvthread+001200 18!wait RUN 012025 0FF 0 00000 0
pvthread+001300 19!wait RUN 013027 0FF 0 00000 0
pvthread+001400 20!wait RUN 014029 0FF 0 00000 0
SLOT NAME STATE TID PRI RQ CPUID CL WCHAN
pvthread+001500 21!wait RUN 01502B 0FF 0 00000 0
pvthread+001600 22!wait RUN 01602D 0FF 0 00000 0
pvthread+001700 23!wait RUN 01702F 0FF 0 00000 0
pvthread+001800 24!wait RUN 018031 0FF 0 00000 0
pvthread+001900 25!wait RUN 019033 0FF 0 00000 0
pvthread+001A00 26!wait RUN 01A035 0FF 0 00000 0
pvthread+001B00 27!wait RUN 01B037 0FF 0 00000 0
pvthread+001C00 28!wait RUN 01C039 0FF 0 00000 0
pvthread+001D00 29!wait RUN 01D03B 0FF 0 00000 0
pvthread+001E00 30!wait RUN 01E03D 0FF 0 00000 0
pvthread+001F00 31!wait RUN 01F03F 0FF 0 00000 0
pvthread+002000 32!wait RUN 020041 0FF 0 00000 0
pvthread+002100 33!wait RUN 021043 0FF 0 00000 0
pvthread+002200 34!wait RUN 022045 0FF 0 00000 0
pvthread+002300 35!wait RUN 023047 0FF 0 00000 0
pvthread+002400 36!wait RUN 024049 0FF 0 00000 0
pvthread+002500 37!wait RUN 02504B 0FF 0 00000 0
pvthread+002600 38!wait RUN 02604D 0FF 0 00000 0
pvthread+002700 39!wait RUN 02704F 0FF 0 00000 0
pvthread+002800 40!wait RUN 028051 0FF 0 00000 0
pvthread+002900 41!wait RUN 029053 0FF 0 00000 0
SLOT NAME STATE TID PRI RQ CPUID CL WCHAN
pvthread+002A00 42!wait RUN 02A055 0FF 0 00000 0
pvthread+002B00 43!wait RUN 02B057 0FF 0 00000 0
pvthread+002C00 44!wait RUN 02C059 0FF 0 00000 0
pvthread+002D00 45!wait RUN 02D05B 0FF 0 00000 0
pvthread+002E00 46!wait RUN 02E05D 0FF 0 00000 0
pvthread+002F00 47!wait RUN 02F05F 0FF 0 00000 0
pvthread+003000 48!wait RUN 030061 0FF 0 00000 0
pvthread+003100 49!wait RUN 031063 0FF 0 00000 0
pvthread+003200 50!wait RUN 032065 0FF 0 00000 0
pvthread+003300 51!wait RUN 033067 0FF 0 00000 0
pvthread+003400 52!wait RUN 034069 0FF 0 00000 0
pvthread+003500 53!wait RUN 03506B 0FF 0 00000 0
pvthread+003600 54!wait RUN 03606D 0FF 0 00000 0
pvthread+003700 55!wait RUN 03706F 0FF 0 00000 0
pvthread+003800 56!wait RUN 038071 0FF 0 00000 0
pvthread+003900 57!wait RUN 039073 0FF 0 00000 0
pvthread+003A00 58!wait RUN 03A075 0FF 0 00000 0
pvthread+003B00 59!wait RUN 03B077 0FF 0 00000 0
pvthread+003C00 60!wait RUN 03C079 0FF 0 00000 0
pvthread+003D00 61!wait RUN 03D07B 0FF 0 00000 0
pvthread+003E00 62!wait RUN 03E07D 0FF 0 00000 0
SLOT NAME STATE TID PRI RQ CPUID CL WCHAN
pvthread+003F00 63!wait RUN 03F07F 0FF 0 00000 0
pvthread+004000 64!wait RUN 040081 0FF 0 00000 0
pvthread+004100 65!wait RUN 041083 0FF 0 00000 0
pvthread+004800 72>xmgc RUN 048091 03C 0 0
pvthread+005100 81>pilegc RUN 0510A3 03B 0 0
pvthread+005300 83>syncd RUN 05302D 03C 0 0
pvthread+005500 85>syncd RUN 0550FD 03C 0 0
pvthread+006300 99>syncd RUN 0630F3 03C 0 0
pvthread+006500 101>syncd RUN 065055 03C 0 0
pvthread+006600 102>syncd RUN 0660EB 03C 0 0
pvthread+006700 103>syncd RUN 0670DB 03C 0 0
pvthread+006800 104>syncd RUN 0680D9 03C 0 0
pvthread+006900 105>syncd RUN 0690D5 03C 0 0
pvthread+006A00 106>syncd RUN 06A0D5 03C 0 0
pvthread+006C00 108>syncd RUN 06C0D9 03C 0 0
pvthread+006D00 109>syncd RUN 06D0DB 03C 0 0
pvthread+006E00 110>syncd RUN 06E0DD 03C 0 0
pvthread+006F00 111>syncd RUN 06F0DF 03C 0 0
pvthread+007100 113>cron RUN 071013 03C 0 0 F10000E31CB80A00
pvthread+007D00 125>syslogd RUN 07D0FF 03C 0 0
pvthread+007F00 127>inetd RUN 07F0FF 03C 0 0
SLOT NAME STATE TID PRI RQ CPUID CL WCHAN
pvthread+008000 128>xntpd RUN 080001 030 0 0
pvthread+008100 129>snmpdv3n RUN 081003 03C 0 0
pvthread+008200 130>hostmibd RUN 082011 03C 0 0
pvthread+008300 131>snmpmibd RUN 083011 03C 0 0
pvthread+008400 132>aixmibd RUN 08400B 03C 0 0
pvthread+008500 133>lcfd RUN 08501D 03C 0 0
pvthread+008700 135>qdaemon RUN 08701F 03C 0 0 F1000092A0049200
pvthread+008B00 139>A_Product-a RUN 08B053 03C 0 0
pvthread+008C00 140>A_Product-b RUN 08C099 03C 0 0
pvthread+008D00 141>A_Product-b RUN 08D029 03C 0 0
pvthread+008E00 142>A_Product-d RUN 08E01D 03C 0 0
pvthread+008F00 143>A_Product-e RUN 08F01F 03C 0 0
pvthread+009000 144>A_Product-f RUN 090021 03C 0 0
pvthread+009100 145>Spwatch RUN 091023 03C 0 0
pvthread+009200 146>getty RUN 09202B 03C 0 0
pvthread+009400 148*rmcd RUN 094033 027 0 0
pvthread+009500 149>Spwatch RUN 09502B 03C 0 0
pvthread+009600 150>A_Product-b RUN 09602F 03C 0 0
pvthread+009700 151>Spwatch RUN 09702F 03C 0 0
pvthread+009A00 154>Spagentd RUN 09A037 03C 0 0
pvthread+009B00 155>log_filt RUN 09B037 03C 0 0
SLOT NAME STATE TID PRI RQ CPUID CL WCHAN
pvthread+009C00 156>Spagentd RUN 09C03B 03C 0 0
pvthread+009D00 157>rmcd RUN 09D03D 03C 0 0
pvthread+009E00 158>Spagentd RUN 09E041 03C 0 0
pvthread+009F00 159>A_Product-b RUN 09F03F 03C 0 0
pvthread+00A000 160>Spagentd RUN 0A0043 03C 0 0
pvthread+00A100 161>Spagentd RUN 0A1045 03C 0 0
pvthread+00A200 162>Spagentd RUN 0A2045 03C 0 0
pvthread+00A300 163>Spagentd RUN 0A3047 03C 0 0
pvthread+00A400 164>Spwatch RUN 0A404B 03C 0 0
pvthread+00CF00 207>IBM.CSMA RUN 0CF09F 03C 0 0
pvthread+00E800 232>IBM.DMSR RUN 0E80D3 03C 0 0
pvthread+012300 291>ns-httpd RUN 123067 03C 0 0
pvthread+012700 295>ats_agen RUN 127081 03C 0 0
pvthread+012900 297>sort RUN 12907F 03C 0 0
pvthread+012F00 303>java RUN 12F029 03C 0 0
pvthread+013000 304>ns-httpd RUN 13007D 03C 0 0
pvthread+013100 305>uxwdog RUN 131099 03C 0 0
pvthread+013200 306>java RUN 1320A3 03C 0 0
pvthread+013E00 318>ns-httpd RUN 13E085 03C 0 0
pvthread+014200 322>ns-httpd RUN 142089 03C 0 0
pvthread+014500 325>ns-httpd RUN 14508F 03C 0 0
SLOT NAME STATE TID PRI RQ CPUID CL WCHAN
pvthread+014600 326>ns-httpd RUN 146091 03C 0 0
pvthread+014700 327>ns-httpd RUN 147093 03C 0 0
pvthread+014800 328>ns-httpd RUN 148095 03C 0 0
pvthread+014A00 330>ns-httpd RUN 14A099 03C 0 0
pvthread+017400 372>virusw RUN 1740EB 03C 0 0
pvthread+017900 377>java RUN 1790F7 03C 0 0
pvthread+017E00 382>java RUN 17E0FD 03C 0 0
pvthread+018000 384>java RUN 180001 03C 0 0
pvthread+018100 385>java RUN 181003 03C 0 0
pvthread+018200 386>java RUN 182005 03C 0 0
pvthread+018300 387>java RUN 183007 03C 0 0
pvthread+018D00 397>java RUN 18D0A5 03C 0 0
pvthread+019100 401>dm_ep_en RUN 19109F 03C 0 0
pvthread+01A400 420>getlvodm RUN 1A4007 03C 0 0
pvthread+01BC00 444>telnetd RUN 1BC095 03C 0 0
pvthread+01D600 470>ats_agen RUN 1D60ED 03C 0 0
=======================================================================================
3. Seeing stack, A product is locking the following 2 threads out of 141 threads.
pvthread+009400 148*rmcd RUN 094033 027 0 0
pvthread+01A400 420>getlvodm RUN 1A4007 03C 0 0
I'd like to know where the number "77"(threads) came from.
4. Accodring to my knowledge, rmcd daemon use a lot of resource. We had a similar experience in the past in which the system hung up when rmcd daemon and A product were running. I'm not sure, but it was caused by interoperation problem with A product or lack of resource. Anyway, what I want to know is if rmcd deamon has no problem or not.
Please, give your answer about the above 4 questions as early as possible.