AIOps

Value of stories

By Keith Posner posted Fri September 10, 2021 01:09 PM

  

What are AIOps stories?

A customer recently asked me: “What advantages does the story concept in CloudPak for Watson AIOps provide over the auto-ticketing capabilities that my operations team already has in place?”

For one thing, I answered, stories provide much deeper insights than traditional ticketing systems. And, CloudPak for Watson AIOps can be fully synchronised with your existing tools, so you can continue to use your ticketing system and current processes, but benefit from additional, deeper insights.

A story is a collection of insights that comes from different data sources such as logs, events, and alerts. Stories can help you build understanding and help drive remediation. But that does not explain the real value of stories. So let’s show this with a typical use case example.

Example: fan failure leads to application failures

An enterprise company runs a wide range of applications that are used across the company, and those applications are hosted on servers. On one of the server racks, a fan fails, and sends some initial alerts to the monitoring system. Shortly after this, one of the key power supply units (PSUs) within the server overheats and causes the server, in turn, to overheat. Eventually the server fails and applications running on those servers become unresponsive. More alerts are received by the monitoring system and customers also begin to call the help desk complaining that their applications are unresponsive.

Automatic development of the story

CloudPak for Watson AIOps goes into action and starts building the story as soon as the first alerts are received from the fan.

Identify relationships between alerts

Because this kind of scenario has occurred before, CloudPak for Watson AIOps immediately tries to match those fan fail alerts to the patterns that it has identified in earlier scenarios. An optional holdoff period is started – to avoid alerting SREs and Operations teams unnecessarily. However, within the next few moments, as PSU and server alerts come in, CloudPak for Watson AIOps automatically groups the alerts as a recognised temporal pattern. The holdoff period can also be switched off so that the story is created immediately.

Runbooks associated with alerts are automatically executed

CloudPak for Watson AIOps looks for runbooks that it can automatically execute to try to resolve some of these alerts. A server reboot is automatically started but fails, and this is documented with the runbook.

Story is identified

At this point a story is created, containing all the relevant data, insights, and issue history. At the same time, CloudPak for Watson AIOps creates a ticket in ServiceNow, and a notification in the SRE’s dedicated Slack channel[1], and synchronises information between all three tools. SREs and operators are alerted and can start working on the issue using their tool of choice: ServiceNow, Slack or Microsoft Teams collaboration tool, or the CloudPak for Watson AIOps console. They can perform troubleshooting and resolution activities in their chosen tool, confident that all tools will synchronise and update automatically.

Story is created

The story has been assembled within minutes and notified to the relevant SRE or operations team. It consists of a wide range of insights, including the following:

  • Runbooks and other relevant automations associated with all alerts[2], automatically executed where possible, with execution results fully documented. Auto-ticketing might enable you to perform a single automation, either using a runbook associated with the alert on which the ticket is based, or by running automation steps within the ticket itself; the advantage of a story is that you can perform multiple automations, as each alert can have its own dedicated runbook.
  • A complete story topology displaying all resources involved in the story, from the fan that failed and generated the issue all the way through to the impacted applications. SREs can play back the topology in time, seeing where changes were made that might have generated the issue. Auto-ticketing might show an affected resource, by extracting the node value from the alert on which the ticket is based; the advantage of a story is that you can see the entire topology involved in the story, with alert and probable cause overlays, and you can play that topology back over time noting correlations between changes to resources and the onset of alerts.
  • Probable cause alerts identified based on an analysis of the underlying topology, the occurrence time of the alerts, and an event classification, that enables Cloud Pak for Watson AIOPs to pinpoint the probable cause alert as the initial fan failure. Auto-ticketing turns your alerts into tickets but your operators and SREs must still manually identify the root cause. Stories point them directly to the probable cause.
  • Similar incidents from the past listed with an indication of why those incidents are similar and a percentage score for level of similarity, to help SREs quicky identify the resolution action to take. Auto-ticketing might include the capability to list tickets that have occurred on the same resource in the past. Stories identify similar tickets based on a range of configurable factors; what’s more, the Similar incident AI algorithm is constantly learning from your library of resolved stories.
  • In addition to all of these advantages, stories also provide insights based on alert grouping using topology, co-occurrence, seasonality, and other factors to help you reconstruct the context in which the story occurred.

Fan is replaced and story is resolved

The SRE uses the range of insights provided to arrange a field engineer callout and have the fan replaced. The story is resolved. 

Epilogue: things to come

This short section describes key capabilities that are planned for future versions of CloudPak for Watson AIOps.

Shortly after this story was resolved, CloudPak for Watson AIOps identified a deeper underlying issue. Based on a correlation of the fan story resolved with earlier,  similar stories, CloudPak for Watson AIOps used the Change risk AI algorithm to determine the following underlying issue:

95% of the time, when server fans are replaced with fan model X, within a year we encounter outages on critical business applications hosted on those servers.

 Based on this insight, post-mortem teams can immediately begin exploring more reliable alternatives to this fan type. SREs and other strategic planners can now ensure that this fan model is not used again, thereby preventing this story from re

[1] Dedicated Slack and Microsoft Teams channels are planned for a future release.

[2] Runbooks are planned for a future release.

0 comments
22 views

Permalink