Automatic Visibility and Observability

 View Only

Reducing MTTR with Instana and RNA workflows - watsonx granite

By Arthur De Magalhaes posted 28 days ago

  

AI: an SRE's best friend

In the past 6 months Instana has been supercharged by AI, leveraging the latest granite models from Watsonx to solve really tough challenges that have taxed SREs for too long.  Probably the most used metric for IT incident is MTTR - which describes the mean time to repair an issue - and it has a 1-to-1 relationship with monetary costs.  For example, you receive a certain amount of money back depending on how long it takes your cell phone provider to restore service, as defined by the SLA (Service Level Agreement).

It's not uncommon to see SREs tasked with SLOs (Service Level Objectives) that have multiple 9's - e.g. an SLO of 99.999% available means an error budget of only 5.26 minutes per year!   Many times a human may not even respond in 5 minutes if an incident occurs in the middle of the night, so how can SREs keep this up?  Only with AI & Automation's help.

AI Background

Starting the journey: AI generated automation with Instana

The latest private and public previews from Instana have explored themes such as probable root cause and incident summarization, which are proving to be key accelerators to the diagnostic and collaborative activities surrounding an IT incident.  We'll zoom into two parallel themes that leverage generative AI, essentially answering:  what to do and how to do it.


In Instana we help SREs figure out the next steps - which can be deep diagnostics or remediative steps - via manual runbooks.  We provide a set of built-in manual runbooks that were generated with Watsonx and then curated by SMEs - e.g. our generated k8s runbooks were carefully tested and edited by k8s experts.   These are available as built-in actions within an issue or incident:

Manual built-in
If there isn't a suitable built-in action you can switch to the live generation tab and edit the prompt that goes into Watsonx:
Manual generation
These steps are always using our latest tested models and go through pre-processing and post-processing pipelines to ensure the best accuracy possible.  The goal is that during an incident these steps can be taken by SREs to restore service, and once operations are back to normal the SRE can then focus on generating and curating automations that can be used next time a similar incident happens.
They can start this task by analyzing the built-in genAI automations - which were created and curated in similar fashion to the manual runbooks - and are in the form of bash scripts and ansible playbooks:
Automation built-in
If there isn't a built-in automation in the catalog, the SRE can pick an existing manual runbook and generate a brand new automation from one of its steps.  Currently this is available in a private preview for bash and coming soon for ansible:
Automation generation
That's awesome!  SREs now have different ways to generate and grow their automation portfolio.  The natural next step is to start stitching these together - for example, you may want to fetch some diagnostics from a failing Pod, restart the Pod, and then send all of the logs to a ticket for auditing purposes.  The steps shown above can definitely generate all of this for you, but how do you put together this flow?

Orchestrating workflows with RNA

IBM Rapid Network Automation is a powerful API-driven tool that allows the authoring and execution of automation workflows.   The RNA and Instana teams have collaborated to create a workflow that can call any Instana automation from within another workflow in RNA!   This workflow has been published to the external automation repository and can be easily downloaded:

Once this workflow is imported into RNA it becomes part of the palette of automations available when authoring a new workflow:
The workflow takes in three simple parameters:  an auth key, the name of the automation and the FQDN of the Instana Agent where you want to run it.  For the auth key, you just need to enter the Instana subdomain and the API token to use and hit the Test Authentication button to ensure everything is correct:
Now you're ready to start innovating!   You can use the hundreds of APIs available natively with RNA together with your very own Instana automation catalog to create flows and sequences that will cut down hours of application downtime and human intervention.  
As a simple example, let's create a sequence that calls Instana's generated host diagnostic automations:
That's it.  When you hit run you'll see the log outputs right inside RNA, and you can even switch into Instana's automation history to double check that the automations actually ran:
You can follow this approach to author additional workflows that will help in future incidents, and then simply create an HTTP action inside Instana that triggers this workflow and associate with any issues/incidents. 
This integration joins the context-based genAI automations from Instana with the vast RNA API library - and puts you in the driver's seat of the orchestration via a powerful no-code UI experience.   We can't wait to see the amazing workflows you'll create!  


#community-stories2

Permalink