AIX

AIX

Connect with fellow AIX users and experts to gain knowledge, share insights, and solve problems.

 View Only
  • 1.  Kernel ....i need help

    Posted Mon December 18, 2006 05:13 AM

    Originally posted by: SystemAdmin


    We got the following dump analysis results from IBM.

    Changed product name to A product.

    ENVIRONMENT:
    9119 83E4F4C
    bos.mp64:5.2.0.85
    .
    PROBLEM:
    System hang.
    Telnet, HMC console login failed.
    Ping OK
    .
    ACTION TAKEN:
    It seems that the system hang was caused by A product.
    Refer to the follwing URL for the product info.
    http://www.....
    So many inetd threads are blocked by A product kernel extension.
    Probably those inetd threads are on being forked by telnet connection requests.
    .
    (0)> stat
    SYSTEM_CONFIGURATION:
    CHRP_SMP_PCI POWER_PC POWER_5 machine with 1 available CPU(s) (64-bit registers)

    SYSTEM STATUS:
    sysname... AIX
    .
    .
    .
    .
    .
    xmalloc debug: disabled
    (0)> status
    short read of 4063 bytes for coff
    0 8802D 136 760FA 118 rmcd
    0x1 is a vnode
    (0)> f
    pvthread+008800 STACK:
    0000B104.unlockl+000004 ()
    045F2AACA product:A product:045F2AAC+000000 () <-- A product SecureOS
    045F2950A product:A product:045F2950+000000 ()
    04601DE0A product:A product:04601DE0+000000 ()
    0000379Csc_msr_2_point+000028 ()
    Not a valid dump data area @ 2FF22500
    (0)> lke 045F2A74
    ADDRESS FILE FILESIZE FLAGS MODULE NAME

    1 08089800 045F20E0 0003A520 00080262 /usr/lib/drivers/A product <-- A product SecureOS
    <snip>
    (0)> th -r | egrep -v wait | grep pvthread | wc -l
    272 <-- # of WCPU threads, which are waiting for CPU
    (0)> th -r | egrep inetd | wc -l
    116 <-- 116 threads of 272 WCPU threads are inetd.
    (0)> th -r
    SLOT NAME STATE TID PRI RQ CPUID CL WCHAN
    <snip>
    pvthread+007600 118>getty RUN 07605B 03C 0 0
    <snip>
    pvthread+026E00 622>java RUN 26E0DD 03C 0 0
    pvthread+02A900 681>inetd RUN 2A9053 03C 0 0
    pvthread+02AF00 687>inetd RUN 2AF05F 03C 0 0
    pvthread+02C700 711>inetd RUN 2C708F 03C 0 0
    <snip>
    (0)> f 622 <-- java
    pvthread+026E00 STACK:
    0016179Cklockl+0003C4 (000000000016179C, 80000000000010B2,
    0000000022422282 ??)
    045F2A74A product:A product:045F2A74+000000 ()
    045F2950A product:A product:045F2950+000000 ()
    045FFC04A product:A product:045FFC04+000000 ()
    0000379Csc_msr_2_point+000028 ()
    Not a valid dump data area @ 36F8B064
    (0)> f 681 <-- inetd
    pvthread+02A900 STACK:
    0016179Cklockl+0003C4 (000000000000B168, 80000000000090B2,
    000000000000B100 ??)
    045F2A74A product:A product:045F2A74+000000 ()
    045F2950A product:A product:045F2950+000000 ()
    04601DE0A product:A product:04601DE0+000000 ()
    0000379Csc_msr_2_point+000028 ()
    Not a valid dump data area @ 2FF22610
    <-- Almost all of WCPU threads are blocked by A product.
    .
    BCGONG0610DUMP

    IBM's dump analysis is as follows.

    1) Dump analysis

    • When the administrator got the systme dump, A product was executing UNLOCK and 77 threads were in WCPU state waiting for CPU resource.
    • This means CPU resource was in extremely competitive demand.
    • Seeing each stack of 77 treads in WCPU state, we can find out they were waiting for LOCK from A product for using CPU.
    • With all these in mind, most of treads in WCPU state were waiting for CPU LOCK in extremely competitive situation causing system hangup.
    • Since A product was occupying LOCK for most treads when the crash happened, support from A product side is needed.
    • This result is based on the analysis by IBM in the US.



    I have questions the above analysis as follows.
    1.
    (0)> f
    pvthread+008800 STACK:
    0000B104.unlockl+000004 ()
    045F2AACA product:A product:045F2AAC+000000 () <-- A product SecureOS
    045F2950A product:A product:045F2950+000000 ()
    04601DE0A product:A product:04601DE0+000000 ()
    0000379Csc_msr_2_point+000028 ()
    Not a valid dump data area @ 2FF22500
    I'd like to know whether A product was doing something else after executing UNLOCK, or this is when A product was executing UNLOCK.

    2. You said 77 threads were in WCPU state. I want to know how you could recognize those were in WCPU state.
    Executing th -r, I got all processes in RUN state.

    =============================================================================

    (0)> th -r
    SLOT NAME STATE TID PRI RQ CPUID CL WCHAN

    pvthread+000100 1>init RUN 001003 03C 0 0
    pvthread+000200 2>wait RUN 002005 0FF 0 00000 0
    pvthread+000300 3!wait RUN 003007 0FF 0 00000 0
    pvthread+000400 4!wait RUN 004009 0FF 0 00000 0
    pvthread+000500 5!wait RUN 00500B 0FF 0 00000 0
    pvthread+000600 6!wait RUN 00600D 0FF 0 00000 0
    pvthread+000700 7!wait RUN 00700F 0FF 0 00000 0
    pvthread+000800 8!wait RUN 008011 0FF 0 00000 0
    pvthread+000900 9!wait RUN 009013 0FF 0 00000 0
    pvthread+000A00 10!wait RUN 00A015 0FF 0 00000 0
    pvthread+000B00 11!wait RUN 00B017 0FF 0 00000 0
    pvthread+000C00 12!wait RUN 00C019 0FF 0 00000 0
    pvthread+000D00 13!wait RUN 00D01B 0FF 0 00000 0
    pvthread+000E00 14!wait RUN 00E01D 0FF 0 00000 0
    pvthread+000F00 15!wait RUN 00F01F 0FF 0 00000 0
    pvthread+001000 16!wait RUN 010021 0FF 0 00000 0
    pvthread+001100 17!wait RUN 011023 0FF 0 00000 0
    pvthread+001200 18!wait RUN 012025 0FF 0 00000 0
    pvthread+001300 19!wait RUN 013027 0FF 0 00000 0
    pvthread+001400 20!wait RUN 014029 0FF 0 00000 0
    SLOT NAME STATE TID PRI RQ CPUID CL WCHAN

    pvthread+001500 21!wait RUN 01502B 0FF 0 00000 0
    pvthread+001600 22!wait RUN 01602D 0FF 0 00000 0
    pvthread+001700 23!wait RUN 01702F 0FF 0 00000 0
    pvthread+001800 24!wait RUN 018031 0FF 0 00000 0
    pvthread+001900 25!wait RUN 019033 0FF 0 00000 0
    pvthread+001A00 26!wait RUN 01A035 0FF 0 00000 0
    pvthread+001B00 27!wait RUN 01B037 0FF 0 00000 0
    pvthread+001C00 28!wait RUN 01C039 0FF 0 00000 0
    pvthread+001D00 29!wait RUN 01D03B 0FF 0 00000 0
    pvthread+001E00 30!wait RUN 01E03D 0FF 0 00000 0
    pvthread+001F00 31!wait RUN 01F03F 0FF 0 00000 0
    pvthread+002000 32!wait RUN 020041 0FF 0 00000 0
    pvthread+002100 33!wait RUN 021043 0FF 0 00000 0
    pvthread+002200 34!wait RUN 022045 0FF 0 00000 0
    pvthread+002300 35!wait RUN 023047 0FF 0 00000 0
    pvthread+002400 36!wait RUN 024049 0FF 0 00000 0
    pvthread+002500 37!wait RUN 02504B 0FF 0 00000 0
    pvthread+002600 38!wait RUN 02604D 0FF 0 00000 0
    pvthread+002700 39!wait RUN 02704F 0FF 0 00000 0
    pvthread+002800 40!wait RUN 028051 0FF 0 00000 0
    pvthread+002900 41!wait RUN 029053 0FF 0 00000 0
    SLOT NAME STATE TID PRI RQ CPUID CL WCHAN

    pvthread+002A00 42!wait RUN 02A055 0FF 0 00000 0
    pvthread+002B00 43!wait RUN 02B057 0FF 0 00000 0
    pvthread+002C00 44!wait RUN 02C059 0FF 0 00000 0
    pvthread+002D00 45!wait RUN 02D05B 0FF 0 00000 0
    pvthread+002E00 46!wait RUN 02E05D 0FF 0 00000 0
    pvthread+002F00 47!wait RUN 02F05F 0FF 0 00000 0
    pvthread+003000 48!wait RUN 030061 0FF 0 00000 0
    pvthread+003100 49!wait RUN 031063 0FF 0 00000 0
    pvthread+003200 50!wait RUN 032065 0FF 0 00000 0
    pvthread+003300 51!wait RUN 033067 0FF 0 00000 0
    pvthread+003400 52!wait RUN 034069 0FF 0 00000 0
    pvthread+003500 53!wait RUN 03506B 0FF 0 00000 0
    pvthread+003600 54!wait RUN 03606D 0FF 0 00000 0
    pvthread+003700 55!wait RUN 03706F 0FF 0 00000 0
    pvthread+003800 56!wait RUN 038071 0FF 0 00000 0
    pvthread+003900 57!wait RUN 039073 0FF 0 00000 0
    pvthread+003A00 58!wait RUN 03A075 0FF 0 00000 0
    pvthread+003B00 59!wait RUN 03B077 0FF 0 00000 0
    pvthread+003C00 60!wait RUN 03C079 0FF 0 00000 0
    pvthread+003D00 61!wait RUN 03D07B 0FF 0 00000 0
    pvthread+003E00 62!wait RUN 03E07D 0FF 0 00000 0
    SLOT NAME STATE TID PRI RQ CPUID CL WCHAN

    pvthread+003F00 63!wait RUN 03F07F 0FF 0 00000 0
    pvthread+004000 64!wait RUN 040081 0FF 0 00000 0
    pvthread+004100 65!wait RUN 041083 0FF 0 00000 0
    pvthread+004800 72>xmgc RUN 048091 03C 0 0
    pvthread+005100 81>pilegc RUN 0510A3 03B 0 0
    pvthread+005300 83>syncd RUN 05302D 03C 0 0
    pvthread+005500 85>syncd RUN 0550FD 03C 0 0
    pvthread+006300 99>syncd RUN 0630F3 03C 0 0
    pvthread+006500 101>syncd RUN 065055 03C 0 0
    pvthread+006600 102>syncd RUN 0660EB 03C 0 0
    pvthread+006700 103>syncd RUN 0670DB 03C 0 0
    pvthread+006800 104>syncd RUN 0680D9 03C 0 0
    pvthread+006900 105>syncd RUN 0690D5 03C 0 0
    pvthread+006A00 106>syncd RUN 06A0D5 03C 0 0
    pvthread+006C00 108>syncd RUN 06C0D9 03C 0 0
    pvthread+006D00 109>syncd RUN 06D0DB 03C 0 0
    pvthread+006E00 110>syncd RUN 06E0DD 03C 0 0
    pvthread+006F00 111>syncd RUN 06F0DF 03C 0 0
    pvthread+007100 113>cron RUN 071013 03C 0 0 F10000E31CB80A00
    pvthread+007D00 125>syslogd RUN 07D0FF 03C 0 0
    pvthread+007F00 127>inetd RUN 07F0FF 03C 0 0
    SLOT NAME STATE TID PRI RQ CPUID CL WCHAN

    pvthread+008000 128>xntpd RUN 080001 030 0 0
    pvthread+008100 129>snmpdv3n RUN 081003 03C 0 0
    pvthread+008200 130>hostmibd RUN 082011 03C 0 0
    pvthread+008300 131>snmpmibd RUN 083011 03C 0 0
    pvthread+008400 132>aixmibd RUN 08400B 03C 0 0
    pvthread+008500 133>lcfd RUN 08501D 03C 0 0
    pvthread+008700 135>qdaemon RUN 08701F 03C 0 0 F1000092A0049200
    pvthread+008B00 139>A_Product-a RUN 08B053 03C 0 0
    pvthread+008C00 140>A_Product-b RUN 08C099 03C 0 0
    pvthread+008D00 141>A_Product-b RUN 08D029 03C 0 0
    pvthread+008E00 142>A_Product-d RUN 08E01D 03C 0 0
    pvthread+008F00 143>A_Product-e RUN 08F01F 03C 0 0
    pvthread+009000 144>A_Product-f RUN 090021 03C 0 0
    pvthread+009100 145>Spwatch RUN 091023 03C 0 0
    pvthread+009200 146>getty RUN 09202B 03C 0 0
    pvthread+009400 148*rmcd RUN 094033 027 0 0
    pvthread+009500 149>Spwatch RUN 09502B 03C 0 0
    pvthread+009600 150>A_Product-b RUN 09602F 03C 0 0
    pvthread+009700 151>Spwatch RUN 09702F 03C 0 0
    pvthread+009A00 154>Spagentd RUN 09A037 03C 0 0
    pvthread+009B00 155>log_filt RUN 09B037 03C 0 0
    SLOT NAME STATE TID PRI RQ CPUID CL WCHAN

    pvthread+009C00 156>Spagentd RUN 09C03B 03C 0 0
    pvthread+009D00 157>rmcd RUN 09D03D 03C 0 0
    pvthread+009E00 158>Spagentd RUN 09E041 03C 0 0
    pvthread+009F00 159>A_Product-b RUN 09F03F 03C 0 0
    pvthread+00A000 160>Spagentd RUN 0A0043 03C 0 0
    pvthread+00A100 161>Spagentd RUN 0A1045 03C 0 0
    pvthread+00A200 162>Spagentd RUN 0A2045 03C 0 0
    pvthread+00A300 163>Spagentd RUN 0A3047 03C 0 0
    pvthread+00A400 164>Spwatch RUN 0A404B 03C 0 0
    pvthread+00CF00 207>IBM.CSMA RUN 0CF09F 03C 0 0
    pvthread+00E800 232>IBM.DMSR RUN 0E80D3 03C 0 0
    pvthread+012300 291>ns-httpd RUN 123067 03C 0 0
    pvthread+012700 295>ats_agen RUN 127081 03C 0 0
    pvthread+012900 297>sort RUN 12907F 03C 0 0
    pvthread+012F00 303>java RUN 12F029 03C 0 0
    pvthread+013000 304>ns-httpd RUN 13007D 03C 0 0
    pvthread+013100 305>uxwdog RUN 131099 03C 0 0
    pvthread+013200 306>java RUN 1320A3 03C 0 0
    pvthread+013E00 318>ns-httpd RUN 13E085 03C 0 0
    pvthread+014200 322>ns-httpd RUN 142089 03C 0 0
    pvthread+014500 325>ns-httpd RUN 14508F 03C 0 0
    SLOT NAME STATE TID PRI RQ CPUID CL WCHAN

    pvthread+014600 326>ns-httpd RUN 146091 03C 0 0
    pvthread+014700 327>ns-httpd RUN 147093 03C 0 0
    pvthread+014800 328>ns-httpd RUN 148095 03C 0 0
    pvthread+014A00 330>ns-httpd RUN 14A099 03C 0 0
    pvthread+017400 372>virusw RUN 1740EB 03C 0 0
    pvthread+017900 377>java RUN 1790F7 03C 0 0
    pvthread+017E00 382>java RUN 17E0FD 03C 0 0
    pvthread+018000 384>java RUN 180001 03C 0 0
    pvthread+018100 385>java RUN 181003 03C 0 0
    pvthread+018200 386>java RUN 182005 03C 0 0
    pvthread+018300 387>java RUN 183007 03C 0 0
    pvthread+018D00 397>java RUN 18D0A5 03C 0 0
    pvthread+019100 401>dm_ep_en RUN 19109F 03C 0 0
    pvthread+01A400 420>getlvodm RUN 1A4007 03C 0 0
    pvthread+01BC00 444>telnetd RUN 1BC095 03C 0 0
    pvthread+01D600 470>ats_agen RUN 1D60ED 03C 0 0

    =======================================================================================

    3. Seeing stack, A product is locking the following 2 threads out of 141 threads.

    pvthread+009400 148*rmcd RUN 094033 027 0 0
    pvthread+01A400 420>getlvodm RUN 1A4007 03C 0 0

    I'd like to know where the number "77"(threads) came from.

    4. Accodring to my knowledge, rmcd daemon use a lot of resource. We had a similar experience in the past in which the system hung up when rmcd daemon and A product were running. I'm not sure, but it was caused by interoperation problem with A product or lack of resource. Anyway, what I want to know is if rmcd deamon has no problem or not.
    Please, give your answer about the above 4 questions as early as possible.


  • 2.  Re: Kernel ....i need help

    Posted Thu December 21, 2006 09:42 AM

    Originally posted by: MarkTaylor


    erm .. I hate to say this because it sounds terse, but call IBM support and ask them to explain in greater detail. You have a support contract that you pay good money for, if you are not getting the support you require then escalate or raise a customer complaint .. what level of support dealt with your issue ?? level 1 / 2 / 3 ?

    Rgds
    Mark Taylor


  • 3.  Re: Kernel ....i need help

    Posted Thu December 21, 2006 11:06 AM

    Originally posted by: SystemAdmin


    dump was in good hands so I would suggest to re-open your ticket and I'm pretty sure the same US guy (or someone else) can provide you good answers.
    Btw, did you follow instrctions that same dump needs to be further examined by the owners of the h....d kernel extension?