AIOps

AIOps

Join this online group to communicate across IBM product users and experts by sharing advice and best practices with peers and staying up to date regarding product enhancements.

 View Only

Why AIOps is better than hiring a data scientist

By Isabell Sippli posted Tue June 22, 2021 08:39 AM

  
 
The first thing many people do when approaching the AIOps trend, is to look for a definition of AIOps.
 Gartner has published one of the most popular definitions:
AIOps combines big data and machine learning to automate IT operations processes, including event correlation, anomaly detection and causality determination.
On seeing this definition, one might think: big data and machine learning … ok, I shall hire a couple of data scientists and let them work with my operations data. Maybe they can help me reduce service tickets, get better insights into my incidents, and maybe even apply some predictions to avoid outages?

Sounds simple?
Actually, it is not.
Let's  explore a couple of reasons why doing AIOps correctly is not so simple when you need to start from scratch:

1. Applying ML to IT data - but what data? 
“Volume, variety and velocity” are the defining dimensions of data. But while you might be able to deal with volume, it gets trickier when it comes to handling variety and velocity. IT (operations) data is highly diverse, and almost always includes logs, metrics, events, and  tickets. If you’re very lucky, each data type is coming from a single source, e.g. all log data is collected in a single place, or all event data is normalized in a system like Netcool OMNIbus. However, we rarely see this when working with large clients. Most often, data is spread over several silos, and includes structured and unstructured data. Besides the heterogeneity, IT data often occurs in large volumes. We frequently work with clients dealing with terabytes of logs each day, combined with double digit millions of events.

2. Applying ML to IT data - but what ML? 
Finding the right algorithms which perform at scale, is hard. Before you hit upon the right algorithm, you have to identify what you are actually looking for. Are you trying to find anomalies in your logs? If so, what anomalies? Are you trying to find correlations for your events? If so, what correlations? What makes your system believe a correlation is useful and meaningful (a true positive)? Once you have found the right algorithm, you have to code the model, customize it to your data, train and tune it.  

3. Applying ML to IT data - but what environment? 
The majority of enterprises operate across  a mixed set of environments; cloud native (on Kubernetes or OpenShift), traditional on VMs or Bare Metals, mission critical application on Z. An added challenge is that these are spread across multiple data centers and clouds. Bringing all this information together is complicated, and validating that a given algorithm applies to specific data and runtimes is time-consuming.

4. Even if overcome the above challenges - how do you present your findings? 
IT Ops is all about speed, as every second counts in case of an outage. Creating regular reports might help you historically, but you also want to present your insights to your team in charge - be it a traditional IT Operations team, or your SREs.

Luckily, IBM is helping to solve these problems. With Cloud Pak for Watson AIOps, we’re addressing those challenges, and more. Here's how:

  1. Cloud Pak for Watson AIOps comes with a large amount of out-of-the-box connectors, which normalize your heterogeneous data into a single store. It provides over 170 adapters & connectors to the most common IT systems worldwide. For those connectors, we have analyzed the data source, identified which is the best information to be pulled, and mapped it to our common format.
  2. Cloud Pak for Watson AIOps builds on more than 100 patents. We are a team of world class data scientists and ML engineers, that have identified the right algorithms and approaches per data type. This group is backed by groups of researchers in IBM Research who are always looking to the future, to see what innovations can be applied to AIOps.
 
  • Event correlation
  • We are combining 3 approaches for you:
  • (i) Through a modified version of FPGrowth (Frequent Pattern Mining) that we apply to historic event data, we are finding sets of events that tend to co-occur.  This algorithm trains automatically on your historic data
  • (ii) Through our application management component,  we can find events that are proximate on nodes in the topology. If you want us to, we can use that proximity as a correlation approach
  • (iii) Searching for similar values of event fields. You can configure which values to look for, and where, and we add that to our correlation.
  • Eventually, we can show how all approaches inter-relate, as you can see here:
  • Log Anomaly detection
  • We apply various types of NLP to create groups of templates. The templates in turn are used to extract features from the log message's text. These features in turn are augmented by specific entities extracted from the log message's text such as for error codes and by features extracted using Language Models. This combined set of features is then used to train a set of models for each of the components being monitored.
  • Metric Anomaly detection
  • We receive and analyze metric data from a range of sources, and then apply a set of time-series algorithms like Robust Bounds, Variant/Invariant, Finite Domain and Predominant Range to capture seasonality, significant trends and to perform forecasting.  (see article here).

  • For all of the above, we have tested our algorithms at scale, and on typical, industry-specific data sets.
    Additionally, explainability is at the forefront of our work. This is critical as without good explainability, there isn't a clear way to establish trust with your IT Operations team, or your SREs.

    In the following image, you can see how we present historic evidence to explain how we have identified that the two depicted events co-occur together. Each purple bar in the timeline identifies one pair of events that occurred. 


3. We are presenting the data at the fingertips of your teams. We can either feed insights into a chat tool of your choice, allow for forensic analysis through a web based console, feed it to an ITSM system such as ServiceNow or allow you to integrate it into any custom application using our APIs.

To summarize
We encourage you to build on all the hard work we have done to make Cloud Pak for Watson AIOps an indispensable tool in the operations space, and therefore save your data scientists' precious time allowing them to focus on your actual business problems.
​​
0 comments
39 views

Permalink