View Only

ChatOps - The Story behind a "Story"

By Nick Markey posted Mon December 06, 2021 04:09 PM

Co-Author : @Arturo Cabre

Watson AIOps version 3.2 delivered many key enhancements across the whole platform, in both the frontend and behind the scenes.  A major focus was consolidating the backend, data is now easier to process and maintain, and is synced across all user interfaces.  Another major focus was on the user experience and how to improve the overall holistic experience.  This was a key driving factor for this iteration.  Sessions were spent to identify pain points that a user might run into and how they could be alleviated.  We'll focus in on one of many of these key enhancements, revamping the ChatOps experience.

The Problem:
By its very nature, an error is unexpected and can result in costly downtime.  The slightest service outage can have rippling effects with a large blast radius, making it difficult to pinpoint and resolve core issuesIn addition, the intricacy of technological environments is on a continuous rise, which drives up their complexity. As a result, typical SRE's like John and Jane –through no fault of their own– are under-informed as they tackle an issue.

Resolving an error requires fast coordination, but in the midst of an outage, it can be difficult to track who's looking into what. When Jane is trying to deploy a fix to a resource, John could be restarting the same resource, undoing Jane's changes. Both had a sense of urgency to resolve the error, but without a coordinated approach, they exacerbated the issue.

There's also no easy way to uncover historical data on a resource and find out if the same error has occurred before and if it is likely to occur again. If it has occurred before, what paths were previously taken to resolve the issue?  (John and Jane had run into an issue like this in the past, but it had been months since they had resolved it. They searched through their notes to no avail, they were back to square 1.)

The Solution:

ChatOps in IBM Cloud Pak® for Watson AIOps helps ease the pain of all the challenges listed above. IBM Cloud Pak for Watson AIOps leverages AI to monitor the health of user environments; when an issue is detected, it provides ChatOps with the information it needs to construct a story. A story is simply an issue wrapped with metadata. ChatOps then alerts the SRE about the story through Slack or MS Teams.

With an intricate, yet intuitive way of helping users handle issues, ChatOps helps identify and surface issues both reactively and proactively.  With the use of ChatOps, users are alerted of critical errors in their environments in real timeWithin ChatOps, there is a balance between providing succinct detail to understand the core issue without bombarding users with information.  It is a hub for facilitating progress towards resolving issues in an efficient, collaborative manner, utilizing progressive disclosure tactics and linkouts to additional resources found in IBM Cloud Pak for Watson AIOps. All the components that make up a story are there to equip an SRE with the right context to swiftly solve an issue with reliability. Derived insights surrounding the events are surfaced in these stories, along with the means to take ownership and action towards resolving the issue. 

Core features in a ChatOps story
1. Metadata section
The metadata section shows a priority level, the story title, when the story was created, impacted applications and the status and ownership of the issue.  Users can modify these values from within ChatOps by editing the story using buttons provided at the bottom section of the message.

2. Probable cause section
The probable cause section displays the top 3 suspected alerts behind the current issue. An alert can be any suspicious activity within the user environment detected by Cloud Pak for Watson AIOps. AI Analysis is then conducted on all alerts associated with the issue to determine the 3 most probable causes. In this section, the ranking and first occurrence of each alert is visible. More metadata for each alert can be seen by clicking the View alert details button.

3. Topology section
A direct link to a topological view within the Cloud Pak for Watson AIOps console is also provided from within ChatOps.  The topology displays a graphical view of a user environment along with the blast radius of the events that occurred.  

4. Alerts section
The alert section shows all related alerts that comprise a single story.  To help zone in on an issue and prevent noise, alerts are deduplicated, grouped, and assigned a severity based on machine learning.

More details for each alert can be viewed by clicking the
View alerts button which will spawn a modal.  Within the modal, the alerts have associated metadata such as frequency and last occurrence. Additional functionalities are available depending on the type of alert. For example: Log anomalies detected by Cloud Pak for Watson AIOps can be downloaded as well as viewed within the modal.

5. Recommended actions section
The recommended actions section presents users with past resolutions to the top 3 most similar issues. AI analysis is conducted on issues that have been resolved in the past to determine which course of action may be most relevant to the current issue. Additionally, ChatOps users can conduct their own analysis by searching across previous issues to find alternative recommended actions by clicking on search past resolution tickets.

6. Actions section
The last section offers a list of actions users can take to modify the story itself. 
  • Edit - This provides a way to modify the story title, description and priority.
  • Self-assign - An option to assign the user clicking the button as the owner of that story.
  • Mark as resolved - A way to move the story into a resolved state which will close the associated alerts and then the story.
  • Set to in-progress - An action that allows a user to move the story into an in-progress state whilst also claiming ownership.  When taking this action, there is also an option to move this story to a separate channel to avoid crowding the main triaging channel.  They can also tag users to pull into the other channel and tag on the story thread. Typically the first action taken on a story.
When taking any of these actions, notes can be added as to why the specific action was taken.  These notes, along with what actions were taken on the story, will be posted in a reply thread, providing a paper trail and a way to track the lifecycle of a story.

ChatOps provides a rich user experience with a user-centric approach to owning and resolving critical issues in a timely manner.  By involving ChatOps, users can streamline their workflow and provide a commonplace to collaborate.  After issues are resolved, the users can go back to them for future reference and see the story, as told through the lens of ChatOps.