Let’s create systems that can fix themselves: Can we create AIOPs monitoring from AIOPs analytics?

Back to Blog List

Let’s create systems that can fix themselves: Can we create AIOPs monitoring from AIOPs analytics?

Like

Let’s create systems that can fix themselves

admin sleeping while system log shows a warning, critical, then healthy!

Can we create AIOPs monitoring from AIOPs analytics

I loved watching the US Open tennis championships this past weekend. Great tennis for sure, but also great new IBM advertisements!! The new IBM Let’s create ad theme resonates well with me, since my team and I, in product development, get to create products that customers use to create solutions for their business needs – so awesome!!

Credit: unsplash.com

Here’s an older Let’s create ad that is exactly what my team in trying to develop. We call it “Detect, Decide, Act”, which matches this IT Automation ad.

I have several create blogs in mind. In this blog, I take a quick look at creating an IT monitoring ability, with emerging analytics solutions, based on the OMEGAMON Data Provider (ODP), that you may have seen me and others blog about.

What are you thinking about creating, to solve one of your business needs? Comment below, or contact me, and I might try to create around your idea and blog about it. Let’s create together!!

Part 1: An ODP Detect success story

For another project, I’ve been glancing at a few ODP z/OS metrics in Grafana daily. I got a funny surprise after a weekend maintenance window: an obvious shift in behavior!!

In one Grafana screen, the CP and zIIPs swapped usage ratios (green and blue), kinda crazy!

In the other screen, TCB jumped up (total_tcb_percent in blue), from 10% to 100%, a 10x increase, also crazy!!

time series metric line graph with big shift upward

This prompted me to look at CPU in the TEP. When I opened the TEP, I quickly saw the ALNWEBS address space was at 100% TCB.

TEP histogram showing huge TCB for ALNWEBS others normal

I notified the system owner and moved on. I recommended the weekend patch activity should be reviewed carefully. I felt great, that ODP and Grafana had rendered a significant behavior change so easily. ODP+Grafana allowed basic Detect, and the TEP allowed deeper debug, for Decide and Act. I knew the burden now was on me, to get AI to watch these graphs, not me! Also, I knew AI should do some correlation with weekend patch activity, not the system admin!!

Part 2: An ODP Detect and Isolate success story

A couple weeks later, I got the inspiration to see what ODP+Grafana or ODP+Kibana alone could do, absent using the TEP. I dreamed of doing some kind of auto time correlation, blindly across all metrics, to see what that might show, being curious of which metrics move together, maybe in some magic ratios, etc., but I ended up pursuing an experiment to see if ODP plus the provided starter Kibana Dashboards could help me learn more about this behavior shift, isolate to the 100% TCB situation for the ALNWEBS jobname.

I quickly jumped to the ODP z/OS starter Kibana dashboards, specifically the Address Space CPU Utilization. I was bit disappointed, that ALNWEBS wasn’t prominent, like on the TEP. I wasn’t thinking too much at this point. I don’t think I appreciated that this dashboard was CPU, and TCB wasn’t really rendered.

Kibana histogram of top ten CPU jobs, ALNWEBS lost in the mix

My brain thinks best in the time domain. I started looking for starter visualizations, that were behind these dashboards, and I was excited to find there already was a starter time series CPU Top Ten Jobname viz, that I could start hacking on, even if it I wasn’t seeing it in the starter dashboards.

I had a notion to go find the “seam” where the shift happen, and zoom in time wise. I think it was dumb luck, that I zoomed in time wise.

I could start to see ALNWEBS prominently, so maybe I was catching on, the CPU could surface ALNWEBS dramatically, like how TCB was prominent in the TEP.

Below are two views of the seam. The second view shows ALNWEBS more clearly, and the first view starts to hint that occasionally there are other jobnames that come and go, that have decent CPU footprint.

time series line graph showing the start of the larger CPU ALNWEBS

time series line graph showing the start of the larger CPU ALNWEBS zoom in

I was still scratching my head some, why the first starter dashboard view wasn’t dramatic, and started to realize that other large CPU users are out there, come and go, and can easily push ALNWEBS further down the Top Ten list, or even off the Top Ten list. Below shows other decent sized jobnames, and ALNWEBS is nowhere to be seen.

time series line graph showing large CPU jobs but no ALNWEBS

At this point, I was wise enough, to look at only 30 minutes, when the system was relatively quiet, and now ALNWEBS was dramatically exposed. I concluded that ODP+Kibana dashboards could have the same effectiveness as TEP!

Kibana histogram of Top Ten jobs zoom in to 30 minutes not other big josb, ALNWEBS is huge now

Lesson’s Learned, Conclusions

I started with a week long view, in the first starter Kibana dashboard above. I thought I was cool, dramatically showing the shift in the time graph at the bottom of that dashboard, like my Grafana views earlier. But in this case, it wasn’t obvious that ALNWEBS itself had a huge shift.

I think my analytics bias is for longer periods of time, while the TEP is tuned for the monitoring domain, so I learned I should be looking at short, recent periods, to emulate basic function of TEP summary graphs.

In this particular case, nothing was dramatically wrong. No one noticed any change in the performance of this system, but it became obvious that a little CPU “virus” crept on to this system that weekend. This virus isn’t big. We saw how it can easily hide in the shadows, behind other valid jobs. It would be a great victory if we had uncovered some actual malicious code, that was new to this system, lurking in the back ground. One of my sayings is: the more you look, the more you know. In my hacking on visualizations, I learned more about ALNWEBS, as well as other things that come and go, on this system, and that a stranger could be lurking in the crowd without anyone noticing, if folks were only tracing some high-level system SLA type metrics… :-o

I think I’m also learning, if we want to have the richness of TEP, in Kibana dashboards, we’d want to port more of the TEP default tables, histograms, time graphs, meters, to Kibana dashboards. In this case, I think a breakdown CPU into TCB, SRB, etc. would be very telling!

I admit, I was a little dubious of a marriage of the monitoring and analytics world, adding forms of monitoring to the ODP stack (ODP streaming into an ELK warehouse), but now I’m much more bullish on this strategy! In prior blogs, I’ve hinted that all kinds of messages, metrics, expert advice articles, trouble tickets, PFA results, CICS PA data are already streamable into ELK, or could be, if they are not already, and then broader correlations could be done, to fully get to Detect, Decide, and Act AIOps!!

In short, Let’s create a combination monitoring and analytics solution, using the ODP stack!! This would be a great first step toward creating systems that can fix themselves!!

Resources

=> Subscribe to my YouTube channel, and other cool AIOps YouTube channels!! https://www.youtube.com/user/drdavew00/featured

=> What to compare notes, hints, hit me up on social media: drdavew00!

=> Comment below, let's start a good discussion on this topic!

=> To find more blogs and other great resources relating to the OMEGAMON Data Provider, you can find these resources here.

AIOps on IBM Z - Group home

Let’s create systems that can fix themselves: Can we create AIOPs monitoring from AIOPs analytics?