AIOps

 View Only

To boldly go where no data insights have gone before…3.2

By Neil Boyette posted Tue November 23, 2021 10:41 AM

  
Authors: @Neil Boyette and @Isabell Sippli

The quest for operational insights – such as finding the root cause of an issue – is to some extent comparable to undertaking a space mission to explore the universe. Operational data can become very large, very quickly (not infinite like
the Universe, though ;)), and just as if you were on an intergalactic mission, you need to have a clear goal and collect and correlate information during your journey.
 
Let’s explore the key phases (not to be confused with phasers, my fellow trekkers) in this data journey re-imagined as a space mission.

Phase 1: Fire up the Engines

AIOps solutions like Cloud Pak® for Watson AIOps provides the guidance system for your data spaceship – essential to help you navigate the vast amounts of unknown, alien information you will encounter. Without such a guidance system, making it to your final destination becomes harder, and takes longer to achieve.
 
So please fasten your seatbelts, while we commence our mission and walk you through what is needed to successfully navigate and understand the ever-expanding universe of operational data.

Phase 2: Prepare to engage

Before gathering any insights, you have to be aware of what is fueling your algorithms. You gain such awareness by collecting heterogeneous operational data. These should be your primary areas of interest:
  • Events: a record containing structured data summarizing key attributes of an occurrence on a managed entity. This entity might be an application resource, some part of that resource – or another key element associated with your network, services, or applications. An event may or may not indicate something anomalous and is a point-in-time, immutable statement about the entity in question.
  • Logs: Primarily unstructured data, when looking at the log message itself, but some semi-structured meta data is typically also available (for example, time stamp, resource which produced the log, etc)
  • Metrics: Time series data - they capture a data point about your monitored services at a defined time, like the CPU utilization of a Kubernetes pod
  • Topology: Spatial information about managed services and their relationship
  • Tickets: Semi-structured data about previous incidents, sometimes including root cause identification and resolution actions
 
Ideally, your AIOps solution has a means to collect this data automatically - like your space ship has sensors to automatically gather information about the universe surrounding it.
 
With the release of Cloud Pak for Watson AIOps 3.2, a vast range of data sources are now supported straight out of the box. Events can be read from over 100 predefined sources, topology information is pulled together from various systems, like Kubernetes, and logs are ingested from popular log management solutions like Humio and Splunk. All you need to do is connect your sources to Cloud Pak for Watson AIOps – and it will set off happily collecting operational data for you. You can find the latest information about supported data sources here.

Phase 3: Jump to Warp Speed

All space missions (fictional and real) collect data and use it to make decisions - and that is exactly what Cloud Pak for Watson AIOps 3.2 does. Your data is pushed through a series of pipelines, such as:
  • Data Normalization - standardize the hundreds of different data schemas into a set of formats that each of the pipelines understands
  • Log Anomaly Detection - continuously look at all the different logs and spot anything out of the ordinary
  • Metric Anomaly Detection - continuously look at all the different KPIs being produced (not just CPU and memory but additional ones that you don’t have time to always look at manually, such as number of threads, heap size etc) and spot anything out of the ordinary
  • Data De-Duplication - if a sensor is repeatedly alerting on the same symptom, then combine them all into a single alert
  • Correlation - group related alerts together so that they can be viewed holistically rather than just individually. This is done though multiple approaches:
    • Temporal - group based on historic co-occurrence of alerts
    • Scope - group based on occurring on the same resource 
    • Topological - group based on the closeness of each alert's resource
  • Alert to Story processing - determine which of the alerts should be elevated to a story. This ensures that you, the captain, are only notified of the important issues and not any of the noise.
All of this cuts out the daily distraction and lets us focus on the mission at hand, which is to keep all the applications up and running.
You can find out more information on the available AI algorithms here.

Phase 4: data, but not as you know it

Cloud Pak for Watson AIOps 3.2 has now given you the whole story, which contains all the context, to understand the situation at hand. The last step though is often the hardest. Now you have to solve the problem....
 
Where to start? Watson AIOps assists you with this challenge also. Based on an alert classifier and an algorithm to determine the location and significance of the related resource in your tree of services, it makes recommendations on the probable cause within a set of alerts that have been correlated previously. 
Once you know which problem to solve, Cloud Pak for Watson AIOps 3.2 will even assist you on possible solutions. It will mine your historical ticket data, and not only pro-actively show any similar incidents to the current situation, but also extract and summarize the steps that were taken in the previous occurrence of the incident. This saves precious time.

The Final Frontier

So now that we have explored how a typical mission evolves, in operational data and insights, you will realize that this is only the start of the adventure. As data is ever-expanding, so are the potential problems that can emerge; that require bold action, carried out in a timely manner. So ask yourself: Are you and your crew full equipped for this undertaking?
For more about Cloud Pak for Watson AIOps 3.2 here.





2 comments
38 views

Permalink

Comments

Thu November 25, 2021 12:35 AM

Loved the analogy, a good summary of the technology that can help SREs sleep more!

Tue November 23, 2021 10:48 AM

@Neil Boyette and I had fun writing this - we hope you'll enjoy our little space mission, too!​