View Only

Demystifying Cloud Pak for AIOps: A comprehensive introduction - Part 1

By Ricardo Olivieri posted Sun November 19, 2023 06:26 PM


Authors: Ricardo Olivieri, Isabell Sippli, Matthew Duggan


In this two-part article, we describe the need to leverage artificial intelligence (AI) and machine learning (ML) as part of an organization’s incident and service management and processes. We then introduce Cloud Pak for AIOps (CP4AIOps) and outline the value and benefits it provides to operations and service delivery teams in disciplines such as IT and network management.

Part 1 presents a brief history of the challenges and complexities in delivering infrastructure, applications, and services over the last two decades. This first part also defines CP4AIOps as a highly scalable event and incident management platform, which leverages policies and automations as mechanisms for quickly diagnosing and resolving incidents. Part 2 delves into the topic of topology and inventory information and the value that such data provide in CP4AIOps. Also in Part 2, the capabilities in CP4AIOps’ AI engine for anomaly detection, alert correlation, risk assessment of change requests, and identification of similar tickets are described.


In the early 2000s, monolithic enterprise applications/services deployed on application servers such as WebSphere Application Server or WebLogic were pervasive in the software industry. Such an application consisted of the front-end component (WAR module) and the supporting backend component (EJBs), all packaged together in one deployable unit, the EAR file. Although such applications were relatively static, management of their health and performance was still challenging due to scale, a myriad of complex and disparate systems, and relatively primitive management and collaboration systems.

Infrastructure and technical service management (such as VLANs) was and remains an important management discipline in its own right, irrespective of applications. Infrastructure was predominantly physical and subject to long procurement times, stringent change control processes, and was managed by highly siloed teams, typically split along ‘IT’ (compute), network, and storage disciplines. Each team and their associated infrastructure had their own standards, management processes, and operational support systems. These teams often had less than ideal collaboration due to system limitations, organisational factors, and operational processes which often resulted in a “finger pointing” culture, something that’s still true in many organizations today. It’s this diversity that led to the birth of the ‘Manager of Managers’ (MoM) and wildly successful Operations Support Systems (OSS) tools like Netcool OMNIbus - the aim being consolidated event/alert management. Customers also inevitably have proprietary data that needs to be factored into the management platform, whether its simple event enrichment or complex use of data from their configuration management database (CMDB) or proprietary data stores.

As years passed, non-monolithic and increasingly automated architectural patterns emerged to simplify the maintenance of applications and to accelerate the development and deployment cycles. These patterns came into existence to satisfy the needs and requirements from Agile development and DevOps practices. They aim to shorten the time windows for the development and delivery of new features and capabilities. At the same time, the number of artifacts that comprise a software solution or application grew as organizations adopted these new architectural patterns. Also, newer types of systems such as mobile apps, IoT, containerization platforms, distributed apps, and others drove complexity up. Long gone are the days when monitoring the deployment of a monolithic application and/or infrastructure is good enough to ensure reliability of the systems that support the operations of an enterprise.

Traditionally siloed management and delivery of separate infrastructure domains necessarily began to blur as virtualisation and containerisation technologies resulted in the commoditisation of what had been specialised hardware from vendors such as Cisco or Nokia. This ultimately led to the ability to run highly specialised workloads such as Virtualised Network Functions as applications on the likes of Red Hat OpenShift and OpenStack as customers sought to reduce procurement time of differentiating services as-well-as operational efficiencies. This seismic shift in how infrastructure can be delivered is a forcing function for improved collaboration amongst teams who typically don’t work closely together. For example, the network team now have to be aware of the implications of running what they see as a standard network function on a compute or container platform - the net result being that much greater and more timely operational visibility and collaboration visibility is needed.

Today’s IT environments and deployment architectures introduce greater complexity for managing and operating the systems in those environments. Hybrid environments are real - encompassing both traditional and new architectures, as well as on-premises and cloud environments. Besides the actual environments that are delivering business services, there is also organic growth of tooling to manage such environments. Organizations often use different point monitoring solutions, either driven by the need of specialized monitoring (like a network or storage monitor), or by siloed internal organisations, where individual groups make decisions tailored to their specific need, whereas central IT teams have the bigger, cross organisational picture in mind, and need holistic oversight of their managed environment.

The increasing use of closed-loop orchestration aims to expedite and simplify the delivery and management of complex applications/services with a view to improving efficiency, such as automatically scaling a workload in response to customer demand. This necessarily results in additional elements and aspects that need to be managed as the operations and application/service delivery teams now need to understand intent as-well-as the actual state of the environment.

In this context, there is a need to drive manageability up to par with complexity. Intelligent mechanisms that leverage AI/ML in IT operations are needed to address growing challenges. Solutions are needed that seamlessly integrate into client environments, without requiring them to re-instrument or change their underlying tooling landscape.

For all the focus on virtualisation and container platforms, there’s still large amounts of physical compute, network, and storage infrastructure that needs to be managed, and that use established and traditional techniques such as SNMP. That presents additional challenges and opportunities as such infrastructure has physical characteristics like temperature, location, power supplies and voltages, which are useful data as part of incident management and planning use-cases.

The following figure depicts the scope of what organizations have to manage and how rapidly it changes. The high degree of volatility depicted adds pressure to operations teams as any tribal knowledge held about the composition of the environment can be diluted as the infrastructure itself can be unpredictable and increasingly opaque, which puts emphasis on providing good context to alerts, logs, and metrics through the use of topology and inventory data.

What is CP4AIOps?

CP4AIOps is a product offering from IBM that operations teams and Site Reliability Engineers (SREs) use to collaborate, minimize outages and incidents in the environments they manage and support, and to quickly resolve them when they occur. You can think of CP4AIOps as your co-pilot for proactive problem determination, remediation and avoidance. CP4AIOps also helps with other use-cases, such as forensic analysis of incidents after-the-fact, maintenance planning, ‘what if?’ analysis, and as an authority for orchestration systems to base their decision making.

CP4AIOps simplifies the complexity of IT operations and accelerates incident resolution by automating and improving end-to-end IT operations across domains (e.g., network, storage, infrastructure, applications, virtualization platforms) at scale with efficiency and resiliency. CP4AIOps uses artificial intelligence (AI), machine learning (ML), and heuristics, and modern technologies such as graph databases to analyze data from diverse sources and domains to help IT organizations automate and accelerate the detection, diagnosis, and resolution of IT incidents. CP4AIOps aims to reduce the burden on operations teams by consolidating disparate data into actionable and context-rich insights while mitigating gaps in tribal knowledge.

CP4AIOps can ingest heterogeneous operational data, which can be structured, semi-structured, and unstructured. This includes data types such as events, logs, performance metrics, tickets, topology/inventory and configuration data. Ingesting and analyzing across management domains and data types allows CP4AIOps to add insights and identify patterns and anomalies that may indicate an impending incident. The operational data that CP4AIOps ingests can come from many different environments and sources, such as on-premises, cloud, VMs, containerized solutions, infrastructure system, networks, and applications among many others, including ones that are proprietary in nature.

CP4AIOps is a powerful platform that helps IT organizations to improve the efficiency of their incident management process by reducing the number and impact of incidents, while dealing with heterogeneous, imperfect data. Crucially, CP4AIOps still provides value in the event that ideal data is not available, such as receiving events about resources not known in the topology inventory.

At the core of CP4AIOps, there is:

  1. A highly scalable event and incident management platform.

  2. A remediation solution to quickly resolve incidents at hand.

  3. A flexible topology & inventory manager to manage heterogeneous application, infrastructure, network and storage entities in near-real-time whilst keeping history of how the environment has changed.

  4. An AI engine that reduces the need for manual rule driven approaches to, for example, detect log and metric anomalies, event correlations and similar incidents.

The remainder of this blog post will detail the core areas of CP4AIOps.

A highly scalable event and incident management platform

CP4AIOps relies on operational data gathered by existing observability and monitoring tools (e.g., Instana, Dynatrace, New Relic, VMware vCenter, ServiceNow etc.), devices directly, and proprietary systems to provide their operational data. Therefore, using CP4AIOps, you can ingest and correlate operational data from different and existing IT tools deployed to an IT environment. For example, let’s say that an IT organization uses:

  • ServiceNow as their IT Service Management (ITSM) tool.

  • A network monitoring tool (e.g., ITNM or SevOne).

  • An infrastructure observability tool.

  • An application observability tool.

  • Log aggregators (e.g., ELK, Splunk, etc.) for centralization and easy access to network, infrastructure, and application logs.

  • Proprietary data sources that hold data about applications/services and infrastructure.

Then that IT organization can use CP4AIOps to ingest, correlate, and add insights to:

  • Their ticketing data (e.g., ServiceNow incidents and change request records).

  • Events, metrics, and topological/inventory information from their observability and monitoring tools (for infrastructure, application and network layers).

  • Logs from the network, infrastructure, and application layers.

  • Augment data from their data sources with information that’s proprietary to them which a view to making CP4AIOps fit established operational practices.

CP4AIOps uses machine learning and heuristics to automatically classify and correlate events from different sources. This reduces noise by deduplicating instances of the same event, providing context (i.e., what is the alert about technically and which applications/services does it relate to?) and by grouping related alerts together. For a group of alerts, CP4AIOps also identifies the possible root cause alert(s) within the group. For this identification, Natural Language Processing (NLP) is applied to every incoming event to analyze and inspect its textual description. This results in a classification of the event that falls into one of the pre-defined golden signals categories (e.g., Error, Latency, Saturation, and Traffic). Based on this classification along with the topological information, CP4AIOps computes the probable cause within a group of related alerts. This helps IT teams to quickly identify and understand the possible cause of IT problems.

CP4AIOPs helps to identify potential incidents early and prevent them by identifying anomalies in logs and metrics. In many cases, before fatal problems occur and business applications are impacted, deviations from normal behavior patterns are reflected in logs and/or performance metrics. By identifying anomalies and correlating them with other operational data, CP4AIOps can help to prevent service disruptions and downtime in your IT environments.

A quick introduction to policies

Policies are a powerful construct in CP4AIOps that allow operations teams to adapt alert processing to their business requirements, encapsulating conditions and actions under one single cohesive entity. A policy definition states the conditions that should be met for triggering a set of actions.

There are two ways for policies to come to life:

  1. You can define policies in CP4AIOps to address different concerns such as suppressing alters, grouping alerts together, triggering the creation of incidents based on the attributes of alerts, triggering the execution of runbooks to remediate known IT problems, and/or for invoking Netcool Impact policies that, say, enrich alerts or notify downstream tools or systems when certain alerts are seen.

  2. CP4AIOps AI engine auto generates policies based on what it learns about the environment.

below there is an example of a user defined policy that groups alerts if they originate from “Prometheus” and have the “DBAdmins” team associated with them. The outcome of this policy is that whenever such alerts have the same resource name, then they are grouped together (note that you can use arbitrary alert fields of your choice to define a policy).

There could also be a need in IT environments to not take actions on certain alerts. For example, depending on the nature of an IT environment, alerts initiated from, say, hosts that have a common name pattern (such as starting with the prefix test) could be unwanted and should be suppressed. This may be the business policy in the IT organization due to these hosts having always ongoing changes that trigger benign warnings and notifications all the time. If such alerts are not suppressed, then these would generate unwelcome noise and distract operations teams. To avoid this situation, a suppression policy in CP4AIOps can be defined to suppress alerts that do not require action as they are from resources that change constantly and do not warrant attention. Suppressed alerts are still present in the CP4AIOps and can be viewed, but they are filtered out of the view by default. Also, suppressed alerts are not taken into account when creating incidents or for triggering runbook automations. The image below shows a suppression policy for a low severity alert from a test application resource named “shop_test_app”.

The next image below shows a policy that was generated by CP4AIOps’ own AI engine. This policy automatically correlates alerts that are known to co-occur together.

A remediation solution to quickly diagnose and resolve incidents at hand

IT and networking environments are complex and the number of events increase with their complexity. Furthermore, the tribal knowledge that operations teams typically had about physical infrastructure rapidly becomes diluted in modern volatile environments, which puts the emphasis on increased automation to help operations staff diagnose and resolve issues. The need to diagnose and fix IT problems quickly without disrupting applications and systems is pervasive across all IT organizations. Using runbook automations in CP4AIOps, you can record and automate manual steps and recipes for known and repeatable problems so that they are run invariably across an organization. You can think of runbooks as automated solutions that are triggered when certain properties in incoming events are seen. Procedures that do not require human interaction are excellent candidates for the implementation of runbooks, which increases the productivity and effectiveness of IT operations processes.

An example of a runbook automation that opens a GitHub issue is captured in the image below. You can easily integrate with any external or internal REST API - adding your custom actions as an automated response to an alert.

You can also use the same automated action as a right click action for your alert, as CP4AIOps allows you to tailor and configure the actions your operations team can run:


We hope we've provided you with a broad understanding of CP4AIOps in this first of a two-part article. In this first part, we described the complexity that exists in today’s IT and networking environments and the advantage of leveraging AI and ML to decrease the mean time to identify, diagnose, and resolve problems across domains. We then introduced CP4AIOps, outlined the value and benefits it provides, and covered the insights gained from ingesting and correlating heterogeneous operational data from disparate data sources. Finally, we also illustrated how policies and automations are an integral part of our incident management platform and discussed the significance of such mechanisms for diagnosing and resolving incidents. We encourage you to continue with Part 2 of this two-part article!