Primary Storage

 View Only
Expand all | Collapse all

FlashSystem 5000 Node RAM ECC errors are not reported in the event log

  • 1.  FlashSystem 5000 Node RAM ECC errors are not reported in the event log

    Posted Thu December 22, 2022 05:32 AM
    I don't understand why such important kernel errors from the message log are not passed to the event log!
    The customer only notices the problem when the node reboots.

    kernel: EDAC MC0: 3 CE memory read error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x516a89 offset:0xd40 grain:32 syndrome:0x0 - OVERFLOW area:DRAM err_code:0001:0090 socket:0 ha:0 channel_mask:1 rank:0)
    edac_monitor[2525]: Wrote 0x516a89 to /run/edac_monitor/mc/mc0/dimm0/last_ce_page

    kernel: EDAC MC0: 3 CE memory read error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x516a8b offset:0xf40 grain:32 syndrome:0x0 - OVERFLOW area:DRAM err_code:0001:0090 socket:0 ha:0 channel_mask:1 rank:0)


    something like that should at least be counted and reported after the threshold value has been reached.

    ------------------------------
    Sebastian Besler vvbasti
    ------------------------------


  • 2.  RE: FlashSystem 5000 Node RAM ECC errors are not reported in the event log

    Posted Tue December 27, 2022 09:36 AM
    Short answer is that we do log an event when the threshold is reached.  However, the threshold is now 100,000 in 24 hours so you are unlikely to hit it.

    ------------------------------
    Tayfun Arli
    ------------------------------



  • 3.  RE: FlashSystem 5000 Node RAM ECC errors are not reported in the event log

    Posted Wed June 21, 2023 11:04 AM

    I know this post is a little old and I've been struggling to find out info on it given the issues we have had this week.

    I've just had the unfortunate case of having to recover from both controllers failing simultaneously in our Flashsystem 5000 due to ECC errors.  Physically it sounded like one controller was making a whirring noise and going into a reboot loop, with the other one just hard crashing, however according to the logs that IBM have reviewed there was nothing to indicate that the RAM had failed likely as it hadn't reached that threshold, nor was any notification provided to the customer.

    From what we've been able to determine after the event happened, the issue built up to be so much of a problem that the issue corrupted all the data on the SAN and the volumes were subsequently unrecoverable.

    The fact that something so critical can cause a storage appliance to literally loose data and to go completely offline is certainly not a good look for IBM and if anything can come out of this, if IBM can at least change their threshold so this can be logged and bought to someone's attention for replacement sooner, it would be greatly appreciated and might just help someone else out in the future.

    Certainly from a customer point of view I'm very disappointed with IBM and the fact that this wasn't reported sooner to us.  Fortunately we had backups so were able to recover from this event but it's still wasted a week of our time and certainly has left me questioning whether we replace this unit with another IBM one or look elsewhere when we refresh our storage in the next 12 months.



    ------------------------------
    Michael McDonald
    ------------------------------