Automated Remediation: Fail Fast, Fix Fast

By Mike Mallo posted Thu October 20, 2022 02:06 PM

Like

Automated Remediation: Fail Fast, Fix Fast

September 29, 2022

A recurring question that keeps arising in the DevOps community is, “how do you fix the many incidents that happen during or right after a release.” Fast detection of incidents and decrease the time spent on remediation.

Rolling our new software can be overwhelming and has been suggested to be the cause of the majority of incidents at most companies. But while developers are working on new quick release cycles, the SRE focus is to ensure the stability and availability of your application. Why? Because, of course, your user expects new innovations and the reliability of your service.

Meme discussing production deployment

But could causal artificial intelligence (Automated problem remediation by means of root cause based) be the solution? Let’s say you have been practicing observability-driven development, and you have been using a sophisticated observability platform where you have auto-discovery implemented in your application and can observe your infrastructure and your programming languages in your on-premise or cloud environment. You are capturing metrics at the one-second granularity, you also have 100% capture of all your end-user transactions, so for any specific customer issue, you are aware of the experience and are alerted when they do not have a good experience.

The starting point for every type of remediation is to observe and detect what’s wrong. The longer it takes for an issue to be detected by your team, the longer it takes to begin the remediation process. That means when automated AIOps are added, the difference between 1-second detection and 10 seconds or more detection becomes enormous. If you’re not detecting poor code that fails fast 1 SECONDS, you can consider getting an observability tool with rapid detection speed. Now you can see what aspect of your code isn’t working. Whew!

How do we achieve the second aspect of this equation?

An observability tool has many benefits including tracking your most critical SLO. it enables you to release with confidence by catching poor-quality code before it reaches production.Using automation to check the health of your software as new release rolls out, allows you to decide whether to promote the canary. Can the automation mechanism be applied to your remediation process? And why should we fix the issue fast? Why do you need to fix incidents fast?

The cost of downtime keeps increasing, Surveys are suggesting that the average cost of unplanned application downtime for fortune 1,000 companies per annum are $1.25 billion to $2.5 billion. Aside from the financial implications, broken SLOs lead to customer dissatisfaction. With the rise of social media and reviews of customer dissatisfaction, it takes less than a minute for your customers to compare your service and might look elsewhere for similar services.

A tarnished brand reputation can discourage new prospects from trying your services. Therefore, what’s automated problem remediation?

What is automated remediation?

definition of automated remediation

Automated remediation of the incident enables your team to fix that issue faster. It ranges from basic alerting mechanisms and logging to fully automated remediation. It’s important to note to benefit greatly from automation it’s better for organizations to work their way through levels of automation. AI can identify and utilizes cause-and-effect relationships to go beyond correlation-based predictive models and toward AI systems that can prescribe actions more effectively and act more autonomously.

comparing manual incident response and automated incident response

Let’s start by reviewing the benefits of automated remediation:

Save time, Improve security, Consistency, and continuous compliance logging

Increase efficiency –saves time, you would not have to react and take action manually. The system would take actions based on past remediation, allowing your team to work on higher value-added tasks. Especially at the enterprise level, the time saved would be significant. Faster MTTR
Increased security –vulnerabilities and problems are addressed immediately upon discovery, preventing issues from escalating into incidents. Deployment rollback will happen automatically
Consistency –every action runs with the exact same workflow, and organizations can be sure that the prescribed procedures are always being followed correctly.
Continuous compliance logging –provide proof of the results of real-time corrections to keep cloud environments compliant, rather than periodic audits. Decreased risk for business experience

What are the requirements for rapid problem remediation that prevents downtime?

Recent reports suggest that the most critical issue during remediation is manual toil (lack of automation) including challenges related to communication e.g, using the right run books or reaching the right people.

You should get an observability application with Automated Remediation that detects problems, underlying incident root causes, and SLO impact across your full stack production deployments.

We’ve identified these top five use cases for automated problem remediation:

Top five benefits of automated remediation

Feature flag settings—Observe application and service behaviors, identify error-causing feature flags, and switch them accordingly to guarantee stable environments.
Process restarts (for example, JVM memory leaks)—Trigger a service restart or related actions for applications with underlying bug fixes that have been deprioritized or delayed.
Kubernetes resource adoption—Act on external, holistic, and customer-centric behavior observations—rather than on only internal parameters—and automatically roll out Custom Resource Definitions (CRDs) to designated environments.
Deployment and rollback—Trigger predefined rollback or roll-forward actions when a faulty deployment violates SLOs or decreases your error budget above the target.
Targeted notifications—Based on the auto-detected details of underlying root causes, keep your business and technical users, SREs, and Operations team updated regarding ongoing remediation actions and escalate if the situation requires higher visibility.

Check out our previous post on how instana wants to help you resolve issues faster. If you would be interested in helping us with our research in this area and potentially trying out some prototype solutions then please contact jeffh@ca.ibm.com. Curious to test driver Instana fast incident detection capabilities, sign up for an Instana Trial right now, and get the level of visibility and contextual information you need to solve incidents fast.

0 comments

56 views

Instana