Primary Storage

 View Only
Expand all | Collapse all

DS3524 not responsive

  • 1.  DS3524 not responsive

    Posted Mon November 27, 2023 02:11 PM

    Hello. I have one DS3524 with one controller connected to server via LSI SAS2 adapter.

    Some weeks ago the link between ds3524 and server is blinked. After some time link is restored.

    One day ago link gone ) I do not see disk in windows, in ds storage manager i see ds3524 in status of out-of-band and after some minutes in unresponsive state. ping to controller is ok. I can connect to controller via telnet. smcli -d -v command show me ip addresses of controller and state Unresponsive.

    I tried to switch of-on ds3524 - no link

    Is it possible to reanimate ds3524?

    Greate Thanks!



    ------------------------------
    Andrew M
    ------------------------------


  • 2.  RE: DS3524 not responsive

    Posted Tue November 28, 2023 11:02 AM
    Edited by Andres Parada Wed November 29, 2023 05:03 PM

    Hello Andrew, 

    given the fact that CTL is currently unresponsive, we need to know first the LED status on this CTRL which can be seen on rear side.  Besides to that please try to connect via telnet to the  CTRL and run the following commands:
    vdmShowDriveList
    evfShowOwnership
    rdacMgrShow
    cmgrShow
    evfShowAllVols
    excLogShow

    As soon as i get the results, will check and try to assist you .


    Best regards; Mousa



    ------------------------------
    Mousa Hammad
    ------------------------------



  • 3.  RE: DS3524 not responsive

    Posted Fri December 01, 2023 01:43 AM

    Thanks for answer!

    all listed commands are unknown on controller.

    only excLogShow works

    log is

    ---- Log Entry #11 APR-27-2018 12:31:49 PM ----
    04/27/18-17:57:49 (IOSymbol2): PANIC: Invalid response sense data:0x110e4010 or
    replyMessage:0x0

    Stack Trace for
    Executing moduleShow(0,0,0,0,0,0,0,0,0,0) on controller A:

    MODULE NAME     MODULE ID  GROUP #    TEXT START DATA START  BSS START
    --------------- ---------- ---------- ---------- ---------- ----------
    RAID              0xebf788          3  0x5f26a60  0x80f4b08  0x81652d0
    RAID1            0x1477658          4  0x1477f20  0x1bc4408  0x1bdef78
    Debug            0x1ea44e0          5  0x2306620  0x24b24a0  0x24b5c38
    IOSymbol2:
    0x0026092c vxTaskEntry  +0x5c : vkiTask (0x11000468)
    0x0017152c vkiTask      +0xec : 0x05f7d6e4 ()
    0x05f7d880 iop::IoScheduleManager::srcOpTask(iop::IoScheduleManager::TaskControl
     *, scsi::Op *+0x1a0: cmd::CmdManager::process(scsi::Op *) ()
    0x01702c54 cmd::CmdManager::process(scsi::Op *)+0xf4 : 0x01a70c20 ()
    0x01a70c94 Thunk for (offset -4) ql::QlManager::~QlManager()+0x9634: 0x06994904
    ()
    0x06994948 symrpc::SymbolManager::utmCmdHandler(scsi::Op *)+0x48 : symrpc::UtmSe
    rvice::handleCommand(scsi::Op *) ()
    0x069af038 symrpc::UtmService::handleCommand(scsi::Op *)+0x3f8: slbSendStatus ()
    0x05fd2a80 slbSendStatus+0x140: 0x05fda7e4 ()
    0x05fda980 normalIoStart+0x1a0: setChkCondOrResConflict(scsi::Op *) ()
    0x05fdcea4 setChkCondOrResConflict(scsi::Op *)+0x44 : htd::HtdItnCmdIoStart(scsi
    ::Op *) ()
    0x05fc630c htd::HtdItnCmdIoStart(scsi::Op *)+0x4cc: 0x06051dc4 ()
    0x06051df0 sas::LtdItn::sendCmdComplete(scsi::Op *)+0x30 : sas::sasIoInSendStatu
    s(sas::_CMD *, unsigned char *, int, unsigned char) ()
    0x06061530 sas::sasIoInSendStatus(sas::_CMD *, unsigned char *, int, unsigned ch
    ar)+0x730: _vkiCmnErr__link ()
    0x0016c5e4 _vkiCmnErr   +0x104: 0x0016c820 (0x56a038, 0x7e00dc0, 0x21e07f0)
    0x0016cbd0 vkiLogShow   +0x570: sxCallback (0x28, 0x5cd33c)
    0x0015c790 sxCallback   +0x90 : 0x01488b44 ()
    0x01488be8 ddcAssertPanicCallback+0xa8 : ddc::DdcManager::ddcInterruptTriggerHan
    dler() ()
    0x01488f9c ddc::DdcManager::ddcInterruptTriggerHandler()+0x23c: ddc::DdcLogMisc:
    :logMisc(REBOOT_REASON) ()
    0x0148784c ddc::DdcLogMisc::logTaskSynopsisInfo(int)+0x12c: 0x014b5974 ()
    0x014b5974 scap::CaptureManager::captureData(const char *, int, bool)+0x6f4: _vk
    iPrintf__link ()
    0x0016abc4 _vkiPrintf   +0x64 : _vkiVPrintf (0x1af8aa4, 0x21e03d0)

    ---- Log Entry #12 NOV-21-2023 05:02:28 AM ----
    ERROR: Port 0 Bad TLP Count 1572864 exceeds threshold 16
    ERROR: Port 0 Bad DLLP Count 1073743408 exceeds threshold 16
    ERROR: Port 4 Bad TLP Count 1572864 exceeds threshold 16
    ERROR: Port 4 Bad DLLP Count 1073743408 exceeds threshold 16
    ERROR: Port 5 Bad TLP Count 1572864 exceeds threshold 16
    ERROR: Port 5 Bad DLLP Count 1073743408 exceeds threshold 16
    ERROR: Port 6 Bad TLP Count 1572864 exceeds threshold 16
    ERROR: Port 6 Bad DLLP Count 1073743408 exceeds threshold 16
    ERROR: Port 2/6 Rx Err Count 24 exceeds threshold 16

    ---- Log Entry #13 NOV-21-2023 05:02:28 AM ----
    ERROR: Type-I Port 0 ECC correctable error threshold exceeded reg 0xf1a val 0x18

    ---- Log Entry #14 NOV-22-2023 06:46:07 AM ----
    ERROR: Port 0 Bad TLP Count 1572864 exceeds threshold 16
    ERROR: Port 0 Bad DLLP Count 1073743408 exceeds threshold 16
    ERROR: Port 4 Bad TLP Count 1572864 exceeds threshold 16
    ERROR: Port 4 Bad DLLP Count 1073743408 exceeds threshold 16
    ERROR: Port 5 Bad TLP Count 1572864 exceeds threshold 16
    ERROR: Port 5 Bad DLLP Count 1073743408 exceeds threshold 16
    ERROR: Port 6 Bad TLP Count 1572864 exceeds threshold 16
    ERROR: Port 6 Bad DLLP Count 1073743408 exceeds threshold 16
    ERROR: Port 2/6 Rx Err Count 24 exceeds threshold 16

    ---- Log Entry #15 NOV-22-2023 06:46:07 AM ----
    ERROR: Type-I Port 0 ECC correctable error threshold exceeded reg 0xf1a val 0x18

    ---- Log Entry #16 NOV-22-2023 07:34:28 AM ----
    ERROR: Port 0 Bad TLP Count 1572864 exceeds threshold 16
    ERROR: Port 0 Bad DLLP Count 1073743408 exceeds threshold 16
    ERROR: Port 4 Bad TLP Count 1572864 exceeds threshold 16
    ERROR: Port 4 Bad DLLP Count 1073743408 exceeds threshold 16
    ERROR: Port 5 Bad TLP Count 1572864 exceeds threshold 16
    ERROR: Port 6 Bad TLP Count 1572864 exceeds threshold 16
    ERROR: Port 6 Bad DLLP Count 1073743408 exceeds threshold 16
    ERROR: Port 2/6 Rx Err Count 24 exceeds threshold 16

    ---- Log Entry #17 NOV-22-2023 07:34:29 AM ----
    ERROR: Type-I Port 0 ECC correctable error threshold exceeded reg 0xf1a val 0x18

    ---- Log Entry #18 NOV-22-2023 11:35:59 AM ----
    ERROR: Port 0 Bad TLP Count 1572864 exceeds threshold 16
    ERROR: Port 0 Bad DLLP Count 1073743408 exceeds threshold 16
    ERROR: Port 4 Bad TLP Count 1572864 exceeds threshold 16
    ERROR: Port 4 Bad DLLP Count 1073743408 exceeds threshold 16
    ERROR: Port 5 Bad TLP Count 1572864 exceeds threshold 16
    ERROR: Port 5 Bad DLLP Count 1073743408 exceeds threshold 16
    ERROR: Port 6 Bad TLP Count 1572864 exceeds threshold 16
    ERROR: Port 6 Bad DLLP Count 1073743408 exceeds threshold 16
    ERROR: Port 2/6 Rx Err Count 24 exceeds threshold 16

    ---- Log Entry #19 NOV-22-2023 11:35:59 AM ----
    ERROR: Type-I Port 0 ECC correctable error threshold exceeded reg 0xf1a val 0x18

    value = 1 = 0x1

    on disks all LED is off. Blink only when I release disk

    rear side look like on picture. Power supply on picture not connected to line. main power sypply is second PS!



    ------------------------------
    Andrew M
    ------------------------------



  • 4.  RE: DS3524 not responsive

    Posted Fri December 01, 2023 03:17 AM
    Edited by Mousa Hammad Fri December 01, 2023 05:15 AM

    Hello Andrew,
    the Command could not be run because the system did not finish teh startup sequence and stopped with 0F on the LED Display.
    LED status 0F means "Application Start".. this is part of the System Startup Checkpoints

    In the excLogShow i can see these messages "ECC correctable error threshold exceeded " reported on 21st and 22nd November
    Please try to run the following command to resolve the issue:
    clearHardwareLockdown


    Try to run this command in case accepted by system:
    ccmInvalidateCacheStoreData

    If the LED status still showing "0F", please check the the 'Autoload Disable' optin in the boot operation menu if it is set to Enable. This should be OFF.
    You can check/change that by accessing the Boot Opetaion menau by runing the comamnd "M". Then select these options 12, 7, 0 sequence in boot-menu to reach this option.
    We had a case long time ago for some unknown reason the 'Autoload Disable' was changed.
    Best regards, Mousa



    ------------------------------
    Mousa Hammad
    ------------------------------



  • 5.  RE: DS3524 not responsive

    Posted Fri December 01, 2023 07:54 AM

    here is results of command

    clearHardwareLockdown

    value = 0 = 0x0


    ccmInvalidateCacheStoreData

    C interp: unknown symbol name 'ccmInvalidateCacheStoreData'.


    M option -
    12
                    CHANGE HARDWARE CONFIGURATION MENU

        -------SOFTWARE SWITCH OPTION--------  --CURRENT--  --DEFAULT--
     1) Switch #1 (PCI Device Config Disable)      Default      Off
     2) Switch #2 (Manufacturing Diagnostics)      Default      Off
     3) Switch #3 (Invoke Boot Menu)               Default      Off
     4) Switch #4 (Continuous Diagnostics)         Default      Off

        ----------SOFTWARE OPTION------------  --CURRENT--
     5) Option #1 (Extensive Diagnostics)          Off
     6) Option #2 (Diagnostics Disable)            Off
     7) Option #3 (Autoload Disable)               Off
     8) Option #4 (Network Enable)                 Off     (NVSRAM Enabled)


    7
    Disk Array Controller - Model 2660

    Board Name:            LSI Logic RAID Controller
    OEM Designation:       LSI
    Board Serial Number:   SV22128343
    Board Part Number:     45233-06
    Schematic Number:      41211-02
    Manufacture Source:    V037846 3LCN01
    Manufacture Date:      05/27/2012
    Board Identifier:      2660
    Vendor Id:             IBM
    Product Id:            1746      FAStT
    Product Revision:      1070
    Ethernet Node Address: 0080E52F1B42
    Battery0 Installation: 04/23/2013
    Battery1 Installation: 12/19/2054
    Subsystem Name:

    Board date and time:   12/01/2023 02:49:04 Fri
    System date and time:  12/01/2023 11:17:22 Fri



    ------------------------------
    Andrew M
    ------------------------------



  • 6.  RE: DS3524 not responsive

    Posted Fri December 01, 2023 12:16 PM

    Hello Andrew,

    thanks for providing the output. The Autoload disabled set to OFF which is correct. You ran the command "clearHardwareLockdown". Please now reboot the CTRL by running the command "sysReboot".

    If the issue persists, try to power cycle the system by switching  the system OFF/ON and check again. If problem persists, we can do nothing more and CTRL needs to be replaced.

    Best regards, Mousa



    ------------------------------
    Mousa Hammad
    ------------------------------



  • 7.  RE: DS3524 not responsive

    Posted Thu December 07, 2023 10:16 AM

    Greate Thanks!!!

    It's work! But only aster manual reboot via OFF-ON

    But what is happend with him?



    ------------------------------
    Andrew M
    ------------------------------



  • 8.  RE: DS3524 not responsive

    Posted Fri December 08, 2023 10:03 AM

    Hello Andrew,

    glad to hear that CTRL now up and running. It seems there are some flags were not set correctly which blocked the start up sequence. A HW power cycle was required to clear this condition. This is s a unique case which we hit many years ago on another customer site.

    All the best and wish you a great weekend.

    Best regards Mousa



    ------------------------------
    Mousa Hammad
    ------------------------------



  • 9.  RE: DS3524 not responsive

    Posted Fri December 08, 2023 12:23 PM

    Why i see in a log

    ECC correctable error threshold exceeded " reported on 21st and 22nd November

    And after it i think DS is lost link to host



    ------------------------------
    Andrew M
    ------------------------------



  • 10.  RE: DS3524 not responsive

    Posted Mon December 11, 2023 08:33 AM
    Edited by Mousa Hammad Mon December 11, 2023 08:33 AM

    Hello Andrew,

    we do not provide RCA for EoS product. But from the excLogShow we can see "ECC correctable error " encountered on DIMM and after that CTRL was locked down. Once we unlocked the CTRL via CLI, seems the power cycle was needed to bring the CTRL up.

    Best regards, Mousa



    ------------------------------
    Mousa Hammad
    ------------------------------



  • 11.  RE: DS3524 not responsive

    Posted Tue April 02, 2024 05:45 AM

    Sorry for bother you.

    You told that it may be because of DIMM error.

    Is it will be helpfull to insert new DIMM module?



    ------------------------------
    Andrew M
    ------------------------------



  • 12.  RE: DS3524 not responsive

    Posted Tue April 02, 2024 06:19 AM

    Hello Andrew,

    If the system runs without issue since December last year and the  ECC correctable error were logged only on 22nd November, then no need to replace any part. If the problem occurred again, we need to have CASD (Collect All Support data) captured from the system to check the current situation.
    Best regards, Mousa



    ------------------------------
    Mousa Hammad
    ------------------------------



  • 13.  RE: DS3524 not responsive

    Posted Tue April 02, 2024 09:28 AM

    No. Last week is 3 failure

    Is it possible to insert non-branded dimm to CTL?



    ------------------------------
    Andrew M
    ------------------------------



  • 14.  RE: DS3524 not responsive

    Posted Wed April 03, 2024 04:37 AM

    Hello Andrew,

    please provide output of the following commands of the affected CTRL to check if we still have the same trigger of the issue:

    loadDebug
    excLogShow
    hwLogShow

    We never used a non branded DIMM in this system, so i can not answer that question.
    I will be on vacation from 04.04.2024 until 18.04.2024.
    Best regards, Mousa



    ------------------------------
    Mousa Hammad
    ------------------------------



  • 15.  RE: DS3524 not responsive

    Posted Wed April 03, 2024 01:17 PM

    loadDebug

    value = 1

    excLogShow

    ---- Log Entry #43 MAR-30-2024 01:52:41 PM ----
    ERROR: Type-I Port 0 ECC correctable error threshold exceeded reg 0xf1a val 0x18

    ---- Log Entry #44 MAR-30-2024 10:30:17 PM ----
    ERROR: Port 0 Bad TLP Count 1572864 exceeds threshold 16
    ERROR: Port 0 Bad DLLP Count 1073743408 exceeds threshold 16
    ERROR: Port 4 Bad TLP Count 1572864 exceeds threshold 16
    ERROR: Port 4 Bad DLLP Count 1073743408 exceeds threshold 16
    ERROR: Port 5 Bad TLP Count 1572864 exceeds threshold 16
    ERROR: Port 5 Bad DLLP Count 1073743408 exceeds threshold 16
    ERROR: Port 6 Bad TLP Count 1572864 exceeds threshold 16
    ERROR: Port 6 Bad DLLP Count 1073743408 exceeds threshold 16
    ERROR: Port 2/6 Rx Err Count 24 exceeds threshold 16

    ---- Log Entry #45 MAR-30-2024 10:30:17 PM ----
    ERROR: Type-I Port 0 ECC correctable error threshold exceeded reg 0xf1a val 0x18

    ---- Log Entry #46 APR-01-2024 11:16:21 PM ----
    ERROR: Port 0 Bad TLP Count 1572864 exceeds threshold 16
    ERROR: Port 0 Bad DLLP Count 1073743408 exceeds threshold 16
    ERROR: Port 4 Bad TLP Count 1572864 exceeds threshold 16
    ERROR: Port 4 Bad DLLP Count 1073743408 exceeds threshold 16
    ERROR: Port 5 Bad TLP Count 1572864 exceeds threshold 16
    ERROR: Port 5 Bad DLLP Count 1073743408 exceeds threshold 16
    ERROR: Port 6 Bad TLP Count 1572864 exceeds threshold 16
    ERROR: Port 6 Bad DLLP Count 1073743408 exceeds threshold 16
    ERROR: Port 2/6 Rx Err Count 24 exceeds threshold 16

    ---- Log Entry #47 APR-01-2024 11:16:21 PM ----
    ERROR: Type-I Port 0 ECC correctable error threshold exceeded reg 0xf1a val 0x18

    ---- Log Entry #48 APR-02-2024 12:18:42 AM ----
    ERROR: Port 0 Bad TLP Count 1572864 exceeds threshold 16
    ERROR: Port 0 Bad DLLP Count 1073743408 exceeds threshold 16
    ERROR: Port 4 Bad TLP Count 1572864 exceeds threshold 16
    ERROR: Port 4 Bad DLLP Count 1073743408 exceeds threshold 16
    ERROR: Port 5 Bad TLP Count 1572864 exceeds threshold 16
    ERROR: Port 5 Bad DLLP Count 1073743408 exceeds threshold 16
    ERROR: Port 6 Bad TLP Count 1572864 exceeds threshold 16
    ERROR: Port 6 Bad DLLP Count 1073743408 exceeds threshold 16
    ERROR: Port 2/6 Rx Err Count 24 exceeds threshold 16

    ---- Log Entry #49 APR-02-2024 12:18:42 AM ----
    ERROR: Type-I Port 0 ECC correctable error threshold exceeded reg 0xf1a val 0x18

    ---- Log Entry #50 APR-02-2024 12:52:27 AM ----
    ERROR: Port 0 Bad TLP Count 1572864 exceeds threshold 16
    ERROR: Port 0 Bad DLLP Count 1073743408 exceeds threshold 16
    ERROR: Port 4 Bad TLP Count 1572864 exceeds threshold 16
    ERROR: Port 4 Bad DLLP Count 1073743408 exceeds threshold 16
    ERROR: Port 5 Bad TLP Count 1572864 exceeds threshold 16
    ERROR: Port 5 Bad DLLP Count 1073743408 exceeds threshold 16
    ERROR: Port 6 Bad TLP Count 1572864 exceeds threshold 16
    ERROR: Port 6 Bad DLLP Count 1073743408 exceeds threshold 16
    ERROR: Port 2/6 Rx Err Count 24 exceeds threshold 16

    ---- Log Entry #51 APR-02-2024 12:52:27 AM ----
    ERROR: Type-I Port 0 ECC correctable error threshold exceeded reg 0xf1a val 0x18


    hwLogShow

    -1



    ------------------------------
    Andrew M
    ------------------------------



  • 16.  RE: DS3524 not responsive

    Posted Thu April 04, 2024 07:50 AM

    Hello Andrew,
    as my colleague Mousa is on vacation he asked me to respond here:
    We still see recent ECC entries in the output. The most likely parts causing this are the DIMMs... at the end it could be as well the DIMM slots.. but we only know that after the DIMMs were replaced
    Kind regards, Sabine



    ------------------------------
    Sabine Gronert
    ------------------------------