View Only

The AI in IBM AIOps – AI Algorithms in Focus

By Georg Ember posted Mon February 26, 2024 11:51 AM


The AI in IBM AIOps – AI Algorithms in Focus

Putting AI into the hands of every CIO

Many of our clients and business partners ask, what actually are the AI capabilities of IBM’s AIOps solution. As the name “AIOps” implies, there must be included some AI capabilities for IT Operations. In this blogpost we will cover the key AI capabilities of IBM’s premier AIOps solution, the IBM Cloud Pak® for AIOps.

Almost every vendor claims to have integrated AI capabilities in their products, but how much is it really and how powerful are these AI capabilities in real IT life? More importantly, how many additional benefits do you gain for IT operations, in contrast to a potentially huge need of system resources like, CPU, RAM and disk, to implement these AI capabilities ?

First of all, why actually is there a real need of AI tools to help the System Administrators and Site Reliability Engineers (SREs) in their daily business ? Because managing huge on-prem IT and cloud farms, many SREs and System Admins are “flooded” by dozens of important IT events per second, and they need to quickly determine, what’s really business critical, and what is not. If an outage of a component occurred and an event has been generated, the so called “blast radius” of a single IT event should not be underestimated, as a single event of a single component failure can have a huge impact (blast radius) on an IT systems behavior.

I always tell my clients: In IT Event Management, a flood of data is better than little or less data. The more (data) you get, the more you know, the more you can analyze and find out, and the more you can predictively avoid in the future. So input data to the Event Manager, a key component of a modern AIOps solution, is an important source.  Curated data from many sources will yield analytics algorithms better able to find correlations which are too difficult for humans to isolate.

IBM AIOps leverages your data to build models that IBM Cloud Pak® for AIOps can use to learn about patterns in your data. IBM Cloud Pak for AIOps can then apply these patterns to provide insights to your Operations and SRE teams.

AI algorithms analyze different kinds of data and produce AI models, which automate research, reducing the need for manual work in solving problems. These unsupervised algorithms enable the use of unlabeled data for training AI models without any human intervention. IBM Cloud Pak® for AIOps training generates models for a number of different AI algorithms. You can train the models for each of these AI algorithms independently of each other.

But which AI capabilities really matter?  IBM AIOps has a few AI capabilities that are unmatched in the IT industry:

Noise reduction:  

Noise reduction for IT events is the task to significantly reduce lots of redundant events and summarize these redundant events to a subset of the really important events. This happens by event grouping and entity linking, where the AI Manager, the key component to apply AI capabilities, will group the incoming data like by topology, temporal, seasonality, scope, and group events that are related or connected, occur in a specific short time window or have the same scope or location.

Event Seasonality:

A special AI capability is the recognition of event seasonality, where events tend to occur with a regular pattern, a regular time or have chronic repeating issues. This analysis usually happens over weeks and months and is helpful to detect repeating issues in a pattern and seasonality.  As the AI Manager has done filtering, grouping and reducing event data to a subset of events, another important analysis task is well known for AIOps: Anomaly detection.

Anomaly Detection :

Anomaly detection is a task to analyze mass data and find data outliers.  Event Data is the foundation for any successful automated solution. An event turns into an alert when an issue is detected from zthe event. You need both historical and real-time data to understand the past and predict what’s most likely to happen in the future.

AI and Machine Learning (ML) is especially efficient at identifying data anomalies— that is, events and activities in a data set that stand out enough from historical data to suggest a potential problem. These outliers are called anomalous events. Anomaly detection can identify problems even when they haven’t been identified before, and without explicit alert configuration for every condition.

Anomaly detection relies on AI algorithms. A trending algorithm monitors a single key performance indicator (KPI) by comparing its current behavior to its past. If the score grows anomalously large, the algorithm raises an issue. A cohesive algorithm looks at a group of KPIs expected to behave similarly and raises issues if the behavior of one or more changes. This approach provides more insight than simply monitoring raw metrics and can act as a „pre-warner” for the health of components and services.

There are two types of AI algorithms in IBM AIOps:

Trainable AI algorithms: These algorithms must be trained on certain data types before they can create deployable models. They include change risk, log anomaly detection-natural language, metric anomaly detection, similar tickets, and temporal grouping. These algorithms are listed as tiles. You can set up the training for each algorithm separately.

The following picture is a screenshot of the current Cloud Pak for AIOPs Version 4.4 AI Model Management graphical user interface and shows the trainable AI Algorithms:

Figure 1: Trainable AI Algorithms in the AI model management UI

Figure 1: Trainable AI Algorithms in the AI model management UI

Pre-trained AI algorithms:  These algorithms don’t require models to derive insights. They are enabled by default. These algorithms include probable cause grouping, scope-based grouping, topological grouping, and log anomaly detection-statistical baseline. Again each of these algorithms is represented by a tile.

The following picture is a screenshot of the current Cloud Pak for AIOPs Version 4.4 AI Model Management graphical user interface and shows the pre-trained AI Algorithms:

Figure 2: Pre-trained AI Algorithms in the AI model management UI

 Figure 2: Pre-trained AI Algorithms in the AI model management UI

Why and when to use AI algorithms ?

IBM AIOps can monitor and detect significant deviations between the actual value of the KPI of interest versus what the machine learning model predicts.


So where does it make sense to look for deviations and data anomalies ?

In IBM AIOps, three major areas are worth to look at:

-              - Metric Anomalies

-              - Log Anomalies

-              - Topology Anomalies (Blast Radius and Fault Localization)

Metric Anomalies :

Metric anomalies can be detected from performance KPIs. A monitoring or observability tool like IBM Instana or Dynatrace or Prometheus sends performance metrics to the IBM AI Manager and these metrics are stored locally in a local  Cassandra database.  AIOps ingests time-series performance metric data and applies several different ML algorithms to learn what “normal” looks like for each one.

Metric anomaly detection first learns normal patterns of metric behavior by analyzing metric values at regular intervals.  If that behavior significantly changes, it raises anomalies in form of alerts.

After a training period of 7 days to 14 days, it will perform automatic baselining for each metric stream and then continue to track each one.  If Metric Anomaly Detection sees any metric stream stray outside of its normal range, it will generate an alert to notify operations that there may be a problem. These anomaly events, as they’re known, are then correlated with other events, and can add additional insights into any developing problem.

The AI algorithm generates alerts when it detects anomalous behavior in your (performance) metrics. Typical metric anomalies are latencies in disk IO, network throughput or application response time and are usually caused by congestion of CPU, memory and IO. It shows bottlenecks in current systems, on typical application tasks like garbage collection, or database access times, but also to predict trends in shortage of system resources or upcoming performance issues. 

The following screenshot of the Cloud Pak for AIOps Alert Console GUI shows a few anomalies :

Figure 3: Metric anomalies in the Alert console UI

Figure 3: Metric anomalies in the Alert console UI

Metric anomalies are important to detect trends in system usage and deviations from “typical” system and application performance, predictively. The goal is to detect a shortage in system and application resources BEFORE they occur and a performance issue happens.

Metric Anomaly data can be trained periodically. Metric anomaly detection is composed of a set of unsupervised learning algorithms. These algorithms learn normal patterns of metric behavior by analyzing metric values at regular intervals. Then, metric anomaly detection raises anomalies in form of alerts when that behavior significantly changes.

Models are produced when at least 7 days of data are present in the system, and training is completed. The algorithm can use up to 14 days of data to learn. After models are trained, you can be alerted to problems before services or applications are impacted.

For details about AI algorithms used in metric anomaly detection, refer to :

Log Anomalies :

Log anomaly detection is a set of learning AI algorithms that takes large amounts of log data and trains on it to learn what is normal behaviour for a given component. IBM Cloud Pak® for AIOps processes incoming logs from log management systems like ELK, Humio or Splunk, as part of the log anomaly detection process. Before training can occur, logs are pulled from the log management system, and are stored in Elasticsearch (part of IBM AIOps) for deep analysis by the AI algorithms.

There are two log anomaly detection AI algorithms, supervised and unsupervised, one with golden signals and one with natural language processing, each of which can run independently of the other. Golden signals are metrics that are – according to the SRE definitions - key metrics to identify a bottleneck like latency, traffic, errors and saturation.

The AI algorithms use natural language processing (NLP) and statistical baseline to find patterns and analyse their frequency. After it has acquired sufficient data for training on, and that training has been completed, Cloud Pak for AIOps will continue to monitor the log data coming in for anything anomalous.

If both algorithms are enabled, then any log anomalies discovered by both will be reconciled, so that only one alert is generated. In this case, the severity of the combined alert will be equal to the highest severity of the two alerts.

Log anomaly detection based on golden signals and Metric anomaly detection work together to continually and generate alerts when the current activity differs from normal activity. Here the time window for the training of the models is about 7 days to 14 days to get ideal results.

Log anomaly detection with natural language processing is an unsupervised learning algorithm that takes large amounts of log data and trains on it to learn what is normal behaviour for a given resource. It uses natural language processing of log messages to find patterns and analyses their frequency. Here the data volumes for the training is starting at 10.000 lines of messages, ideally up to 100.000 lines of messages. When selecting longer date ranges, consider that it might take more time to process the data and could have need more system resources to train the data in an AIOps system.

Anomalies found in system logs or application logs will generate events and can be displayed as alerts within the context of an incident coming from alerts in the AIOps Alert Viewer. The goal of detecting an anomaly in a log is to create an event and send it to further processing to the AI Manager, which applies policies to turn an alert into an incident, based on conditions defined in the policy. Analyzing logs by AI algorithms is a supportive task to identify and generate events from the log data if an anomaly has been detected in the logs.

Topology Anomalies (Blast Radius and Fault Localization):

A topology is a graphical representation of an application (service), or a resource group, or an individual resource. When you render a topology, it displays all of the constituent elements and relationships that make up your topology, and then lets you refine and manipulate the view. You can use the interactive topoloy tools to drill into the individual resource details and status, enable a timeline to view changes over time, and more. You can also use this view to triage issues when an incident occurs.

Anomalies in a given system topology can be seen when a failure of a component occured and the failing component will be marked with a red color on the topology map.  The anomaly is shown on the component where an alert has been logged for. On the topology map you can see a potential blast radius which other components might be affected by a failing part. The Topology Manager of IBM AIOps shows the full scope of components using a vertex-weighted topology graph traversal and a Reasoning engine to understand the meaning of the topology relationships.  The blast radius is determined via directional dependency analysis of the related components that interact with the localized source of the issue. For a given time in the past (up to normally 7 days), you can identify changes and anomalies in a topology by examine the time window and check the delta view. The Topology Manager shows topology changes for the time frame you selected.

Different topologies over the time can then be stitched together and enhanced with custom icons and tooling to provide visualisation of a connected environment, regardless of where the topology data has come from. Alerts are overlaid over the topology providing a visual context to a developing problem.

For more information on viewing topologies refer to

After effective event correlation and automation have done their work, the next step is to calculate the most likely probable causes from what’s left. Finding the most likely probable cause is often known as finding the root cause of an issue. 

Probable cause localisation is about finding the most likely source of an issue within a topology and taking all relationships and connected components into account. The AI algorithms analyse the topology graphs to understand the meaning of the topology relationships and checking the connectivity of the resources and resource details (configuration items). The AI Manager combines all derived event data and topology data through statistical comparison and calculates the percentage of a probable cause and make suggestions which component and area to look at.  The AI Manager combines all derived data to an incident and provides recommendations in the incident view how to solve the issue.  

Another element will also be taken into account when trying to find the reason for an incident: The similar-incident-analysis, also known as similar-ticket-analysis.

Similar ticket analysis (also called Incident Similarity)

When an incident occurs, it can be helpful to review details for similar incidents to help determine a resolution. Similar tickets is an unsupervised learning algorithm that aggregates information about similar messages, anomalies, and events for a component or a service. The idea behind finding similar tickets of incidents in an IT service management system like ServiceNow is to find the top k ranked similar incidents from the past, for a given problem description. This helps to most likely understand the current issue and shows previous successful actions to resolve the issue. If the incident description is well documented in the ITSM system, the AI algorithms use natural language processing to make an entity-action extraction, and Action sequence mining, to summarize what was done to resolve the issue. 

To train this AI algorithm, you need to provide incident ticket information from ServiceNow, as IBM Cloud Pak for AIOPS has a direct, out-of-the-box- connection to ServiceNow.  There is no set minimum for how much data to provide; the AI will train on whatever amount is available. The more tickets that can be consumed however, the better. Any data collected for historical training will be available to use as data.

When an IBM AIOps System is integrated with an IT Service Management System like ServiceNow, you even have more capabilities to perform in-depth ticket analysis of incident. The Change Risk Analysis is an evolution of the Incident similarity use case.

Change risk is an unsupervised learning algorithm that takes historical data from tickets and helps you determine how likely it is that a given change would cause a problem. This determination is based on how successful that similar change was deployed in the past. Using the assessment score provided by change risk, you can determine how safe it is to proceed with the change. Training this AI algorithm will help you ensure that risky changes for a service are assessed before deployment.

The following table shows a summary of use cases that can be addressed by IBM AIOps:

Figure 4: Summary table of IBM Cloud Pak for AIOps use cases based on AI algorithms

 Figure 4: Summary table of IBM Cloud Pak for AIOps use cases based on AI algorithms

Summary :

IBM Cloud Pak for AIOps has a wealth of features that provide huge value to operations, as well as significantly extend any existing traditional on-premise IBM Netcool Operations Insight deployment. The key capabilities based on AI algorithms are:

Event Correlation:

IBM AIOps automatically analyses all events that flow through the system, looking for events that suspiciously always occur together. IBM AIOps leverages the topology to correlate events together based on connectedness of the underlying topology. The correlation engine in AIOps runs all three of these correlation methods simultaneously and merges groups together where the groups overlap. This can happen when events are members of more than one group. This is what makes the event correlation capability of AIOps so powerful.

Event Seasonality:

An embedded machine learning (ML) algorithm processes all events that pass through AIOps looking at when each event occurs. AIIBM  is looking for a perceivable pattern to the timing of when events occur, this is marked up in the event as being “seasonal”.  

Anomaly detection:

Lots of IT event data document and record failures, issues or trends that lead IT systems to an outage. To exhaust this mostly unstructured data, AI algorithms with their machine learning capabilities are the first choice to find out issues and correlate them to avoid future incidents. Thus, AI algorithms are also seen as predictive algorithms that learn from the past (data trained) and apply predictions to avoid failures in the future. 

Probable-cause analysis:

After effective event correlation and automation have done their work, the next step is to calculate the most likely probable causes from what’s left.  AIOps does this by applying a graph analysis algorithm to the underlying topology that relate to the events, then does a keyword analysis of the related events. The probable cause analysis is customisable so that certain keywords can be earmarked for special treatment when they come up.

For an IT administrator or SRE the use of AI algorithms to analyze and correlate IT event data is an undoubtedly valuable set of tools, that help to avoid upcoming issues and failures : to foresee and avoid incidents in the future, should be a primary focus of every IT department. It’s not the question of why or when to use an AIOps solution, but rather how and where to apply the AI based tools to stabilize the IT landscape.  Every IT system or application, due to their nature and complexity, will get issues, earlier or later, so it’s up to you to get a tremendous help with AI tools in IBM AIOps.

In AIOps we trust !

Authors :

Georg Ember,

Malte Menkhoff,