AIOps on IBM Z

AIOps on IBM Z

AIOps on IBM Z

AIOps on IBM Z is a group that brings together IT professionals to share their knowledge and expertise on leveraging AI-driven intelligence and IT Operations in order to accelerate decisions to maintain resiliency through the use of AIOps on IBM Z

 View Only

What on earth is AIOps

By Juergen Holtz posted Tue July 11, 2023 10:03 AM

  
What on earth is AIOps

Occasionally, my friends or new colleagues ask me what I am doing at IBM and when I tell them that I work in AIOps, I often see questioning eyes as they try to make sense of that artificial word. Usually, I then explain it like you see it in this short article which makes them understand what operations of a mainframe means and how it can affect them, even if they are not related to IT. So, let me also explain it to you.

The mainframe is a highly available, highly resilient computation platform optimized for high throughput and high performance, made to run the world's most critical workloads such as core banking systems, credit card payments, airline reservations and more. These workloads are expected to be always up and running, 24x7.

Operations, or in short Ops, means to ensure that the services expected to be provided by this platform are delivered with the availability (e.g., no downtime) and quality (e.g., sub-second response time) to their consumers. Consumers are first and foremost the lines of businesses that utilize these services within their applications. But at the end, it is also you and me as clients who want to withdraw money or order a book who depend on them.

Unfortunately, the world is not perfect, and every system is vulnerable to disruptions: Hardware can break, software can have bugs, workload demand can be unexpectedly high, and administrators can make configuration errors. These are just some of the most common causes for disruptions. When any of these happen, we speak about incidents and then operations are responsible to minimize the impact of an incident and to resolve it as quickly as possible.

In a previous blog, I already wrote about the four phases of an incident management process and the related tool-chain, and I want to briefly revisit it here using the following picture:

Incident Management Toolchain and the IBM Z AIOps Framework Detect, Decide and Act

As illustrated, incident management is about the interaction of many different, specialized capabilities in a way that enables an operations team to collaboratively detect any disruption, to analyze the disruption and to identify the underlying root cause, to decide what needs to be done by whom, and eventually, to act and resolve the incident.

There is no single tool that can handle all these capabilities. Therefore, I believe that the best way for dealing with this challenge is by enabling our own tools to nicely integrate with each other, to offer integration points from wherever you start and wherever you might want to go on this journey.  One important aspect is the focus on open standards such as OpenTelemetry and OpenAPI REST APIs which also enable our clients to use the best tool for a given job in their heterogeneous systems management landscape. This integration is not limited to only the IBM Z tool-chain but explicitly includes the integration with IBM Instana® Observability and IBM Cloud Pak® for AIOps, as both tools are essential to observe applications running in the hybrid cloud, which means across different platforms, partly on premise, partly on private or public clouds.

Very likely, there is also no single person that has all the skills required to process the incident by themself. In addition, our clients often face baby boomer and post-pandemic challenges such as experienced people reaching their retirement age and others working more and more from home instead of on site. For this reason, I think that this establishes a trend, a new way to work and operate the IT-landscape using chat tools where different people collaborate with each other on incidents. But not just that, also chat tools where people can access the necessary tools in said tool-chain directly from within the chat window and where the chat delivers an integrated user experience at the end.

While I've spent much time on the Ops part of the artificial word so far, let's now also look at what the AI part means.  AI, you've already guessed, stands for Artificial Intelligence, and the idea behind the word AIOps is to pair Operations with Artificial Intelligence, to use this exciting technology to ease the life of an operator. I can think of different use cases, where AI can make a difference.  In the previous paragraph I talked about chat and in fact, chat is one of the most popular areas where AI shines when it comes to Natural Language Processing.

The underlying large language models have been trained on billions of publicly accessible documents and hence are pretty good in making sense of written sentences, understanding the user's intent, yes even predicting responses word by word. As we move forward on our AIOps journey, we are looking into enabling users to have conversations that feel very much like conversations among humans. At the same time, we are also aware of the risks that this technology brings. According to a 2023 IBM Institute for Business Value AI Global pulse survey, 84% of business leaders see at least one trust-related issue as a roadblock to Generative AI. So, as we adopt this new technology, we want to make sure that we provide trusted AI that is responsible and governed.

Aside from the large language models, there are also other usages of AI. In the past years, a lot of focus was given on anomaly detection in different areas. Some algorithms are based on statistical learning of metrics with the goal to understand a baseline and toleration limits for a certain degree of deviation from the baseline. Actual values above or below such toleration limits can then be considered an anomaly. Other algorithms focus on anomalies regarding the arrival of messages, for instance messages in the z/OS OPERLOG. The goal of these algorithms is to detect messages that, for instance, haven't been seen before or messages that appear more often than in the past. These are indicators of something abnormal. Yet another set of algorithms is made to correlate different sources of data with the goal to detect hidden patterns between these sources which again could be an indicator of some anomalous behavior.

The use of tools such as IBM Z Anomaly Analytics or IBM Workload Interaction Navigator can then help to detect and identify root causes quicker and thus contribute to the overall incident management process goal of resolving incidents as quickly as possible.

These were just a few use cases, and I am sure more will come in the future. To summarize, I want to leave you with an impression about the many tools that play a role in AIOps as it is illustrated below.

IBM Z AIOps Overview

I hope this gives you a few insights about what we do in AIOps for IBM Z and what guides our agenda. I would like to end with an invitation for you to join our upcoming webinar on September 12, where you will hear from Luke de Kansky, WW IBM Z Software Business Unit Executive, and myself on what we are going to deliver in the second half of this year. I look forward to seeing you there.

0 comments
32 views

Permalink