AIOps on IBM Z - Group home

Be prepared with time series trending ML of OMEGAMON metrics

Sys admins preparing for Cat 5 storm

Disaster Preparedness: the value of forecasting

I was in Boy Scouts growing up, and their motto was “Be Prepared”.  So too, when I present AIOps machine learning and AI to mainframe customers, the words “predict” and “forecast” soon pop up, which makes sense – a good forecast is important in so many industries, whether its forecasting weather for agriculture, or predicting customers holiday buying, so that retails can place orders on their wholesalers months in advance.  Car manufacturers need forecasts to decide how much capacity to tool up their factories.  Forecasting is critical for all of us.  I was really excited when IBM acquired The Weather Company, as forecasting, and other environmental data, is so important to consumers and businesses alike.  The Weather Company not only makes millions of local weather predictions hourly, but also predicts pollens, molds, ozone action days, virus levels, pests, and much more.  With today’s application of probability, statistics, ML, DL, prediction and forecasting is a critical discipline, a giant leap from the days of crystal balls and Tarot cards!! 




IBM zSystems Outage Avoidance

IBM zSystem administrators and SMEs tell me they want forecasts, at least for resource exhaustion.  I call this use case “outage avoidance”.  I came to IBM zSystems six years ago, and I’ve personally experienced outages of CICS region storage, CICS Max Task, Db2 MAX DBATs, max data set VTOCs, zFS 100%, JES2 JQEs, JOEs, SPOOL at 100%, max TCP sockets.  The list goes on and on.  I’ve even seen some of these in customer data.  What’s your list of IBM Z outages, that you wish you could be more prepared for, and be ready to weather the storm?   Nobody wants to be the poor soul, walking to town to buy more fuel for their vehicle.



IBM Z Outage Avoidance techniques

ML is all about generating “rules” from the data, rather than interviewing SMEs, and writing classical software.


For the outage avoidance use case, we want the artificial brain to learn three things from the data:     

  • How to spot upward or downward inflections. I call an upward inflection a hockey stick. 
  • The artificial brain needs to learn what an “outage” looks like
  • It needs to extrapolate an on-going ramp to the outage condition.


Below is an example of how an artificial brain can learn what MAXDBAT is set to in Db2, rather than probing Db2 installation parameters.   The yellow data points below are Db2 ACTIVE_DBATs, and the MAXDBAT ceiling is obvious, as the MAX for this time series, 150, keeps being hit, and never exceeded.  There are definite, obvious flat spots in the yellow ACTIVE_DBATs dataset – simple for the machine to train on!!

Copyright IBM 2022

Here’s a diagram, and actual IBM Z data, showing an upward inflection and on-going ramp

 Copyright IBM 2022


One technique is using a linear regression “fit” for the on-going ramp, to determine the future time when the outage condition will be present.

 Copyright IBM 2022


Statistical guardbanding is too slow

 Most current AIOps efforts are doing a baseline, to establish normal, and then alerting when metrics are statistically far from the mean, maybe 5 sigma (99.999 percentile), but in my studies of mainframe data, this approach is too late, there’s little notice, between 5 sigma and a resource outage.  Alerting when the “hockey stick” inflection first starts is a much better approach, gives more advance notice, and if the up trend has a good linear regression “fit”, we can also forecast “landfall”.  I’ve seen where my linear regression fit code, can forecast the future time when Db2 MAX DBAT will occur!!  Isn’t that what we really want from AIOps, forecast landfall well ahead, so the mainframe storm shutters can be lowered well in advance!! 


IBM Z mainframes regularly weather all kinds of storms

Often for hurricanes, residents need 24-48 hours of landfall notice, to be able to evacuate.  Mainframes can’t evacuate, they need to roll down the storm shutters, and ride out the Black Friday Cat 5 storm!! Maybe they turn on some more processors temporarily, using the processor on demand feature. ;-)

ML-Enhanced OMEGAMON Situation update

I’ve been working on this “outage avoidance” use case for several years now, and now, IBM is working to add time trending to OMEGAMON Situations.  ML-enhanced situations were released for OMEGAMON for Db2 PE (OMPE) in Fall of 2021, where the machine learned the threshold, which I call self-calibration or Adaptive Thresholds.  Isn’t this the type of thing that ML should be used for?   We can train artificial brains to watch for inflections, upward or downward, in OMEGAMON metrics.  Artificial brains can learn important trends themselves, or the artificial brains can tap the right thresholds in z/OS and IBM Z Middleware parameter datasets, configuration files.

The best part:  AI means taking action

All this ML is cool, but its really about reacting, which is what AI is all about -- taking action, like self driving cars, which evaluate the situation, many times per second, and create driving instructions for the car, every few milliseconds!!   Many customers tell me they have playbooks or automation to act when OMEGAMON situations trigger.  This is the best part.  Over time, I'd like to see IBM develop a recommendation engine, based on many, many factors, like messages, metrics, traces, prior tickets, documentation, and more!! 


drdavew00 YouTube channel

=>  A video on my YouTube channel, showing more about the concepts of ML-Enhanced OMEGAMON situations:

=> Subscribe to my YouTube channel, and other cool YouTube channels!! 
=> What to compare notes, hints, hit me up on social media:  drdavew00!
=> Comment below, let's start a good discussion on this topic!
=> To find more blogs and other great resources relating to the OMEGAMON Data Provider, you can find these resources  here.


Thu July 28, 2022 01:43 PM

Hi @Dave Willoughby

Fantastic blog + I like the idea of a recommendation engine !