Recently, I had the opportunity to participate in a joint mainframe study between IBM and a large financial services company in Europe. As part of that study also the customer's service management processes were explored and the goal was to figure out where and how these processes can be optimized and further automated.
The challenge that I faced was to quickly get an overview about individual service management processes but also the tools that facilitate these processes for the people in their day to day job. Now, you probably rightly assume that the author of this blog should be very familiar with all this. But to be honest, the truth is that as soon as you start to dig a little deeper, it becomes very obvious where the gaps are. So, let me share with you, how I approached this to fill the blank spots.
One of the service management processes that I looked at in more detail was Incident Management. As you know, Incident Management is about restoring the normal service operations as quickly as possible, once an incident has been detected. It helps to ensure that service level objectives and availability targets are met. Conceptually, the tooling required to support you in this process is depicted in the following picture:
The process requires tools that monitor the individual services to detect issues. This is all about monitoring, tracing and logging. Once an incident was detected, it requires tools to create a ticket, to analyze the data, and to reduce it to consumable bites for the Site Reliability Engineer (SRE) to work with. Then, there are tools that assist you with planning tasks such as notifying the appropriate specialist and suggesting possible solutions. Finally it requires tools that put all this into action through collaboration and knowledge sharing with the help of chat tools, bots and automated runbooks.
So far so good. But how do the products in the IBM zSystems AIOps portfolio fit to this picture? How can I learn myself a bit more about those products that are not necessarily part of my own day job?
Luckily, this learning journey wasn't so difficult. I used the AIOps framework that we have created around the areas Detect, Decide, and Act and that framework highlights the different capabilities that are required to successfully operate your mainframe systems in the context of your hybrid cloud environment. Please, refer to Sanjay Chandru's blog for an introduction of this framework and subsequent blogs referred by it that discuss the three areas in more detail.
On top of the framework, we have also created the AIOps on IBM zSystems Handbook which organizes the products in the IBM zSystems AIOps portfolio along Detect, Decide, and Act. The overview fits on a single page as depicted in the following picture:
For instance, if you are interested to find out more about Collaborative incident remediation, you can click on the color-coded hand icon next to this heading to jump to a more detailed page introducing IBM Z ChatOps and Service Management Unite. That page then briefly talks about challenges, what new capabilities you might consider that address these challenges, and what IBM can offer in that space. And if that's not enough, you can also link to a blog with more details about this section.
So, after studying the handbook and getting a better sense of all of our products' capabilities, I was able to complete the IBM view of the Incident Management tooling overview easily. The result is shown below: