Cloud Global

 View Only

30-sec SRE - Post-Incident Review

  • 1.  30-sec SRE - Post-Incident Review

    IBM TechXchange Speaker
    Posted 20 days ago

    As SREs, we treat incidents not only as a fixing opportunity but also as a learning opportunity. This requires us to dig deep into the incident, recognizing that there is rarely a single root cause. Typically, we rather see a "perfect storm" of multiple contributing factors coming together, producing the incident. 

    When I studied Theoretical Medicine (in addition to Computer Science) at Technical University of Munich, one of the highlights was attending the forensics class, discussing autopsy's. It was amazing to witness the scrutiny and due diligence performed, searching for and linking evidence together to understand what had happened.

    A post-incident review identifies why the incident happened and puts sufficient measures in place so that a similar incident does not occur again in the future. The Postmortem process aims to improve your team's adaptive capacity, which is key to Resilience. Postmortems should highlight what went well during an incident, in addition to describing what went wrong.
     

    30-sec SRE - Post-Incident Review


    ------------------------------
    Ingo Averdunk
    Distinguished Engineer
    IBM
    ------------------------------