AIOps on IBM Z - Group home

OMEGAMON + SA = Self Driving Mainframes

  
social tile showing driver with his feet up, while the car/mainframe drives itselfcopyright IBM 2022

Self-Driving Cars
enhanced view of a driver in a tesla, not doing anything, and the computer is doing object detection

 

I’m about to put enough solar on my house in Austin, TX, to power my house, and an electric car – zero footprint, no worry about gas prices, here we come!  Big fat tax credit ain’t bad either!!  I’m struggling on the electric car choice.  To test a Tesla, I rented one in Texas, and drove it to New York and back.  I loved all its advanced features: following distance-based cruise control, based on computer vision from the eight cameras, object avoidance, and auto steering, where a lot of these features are implemented via deep learning AI.  Tesla’s self-steering works extremely well on clearly striped interstates in the daytime, but throw in some rain or glare from a setting sun straight ahead, and things get interesting.  I’ve experienced phantom braking, phantom wiper wipes, phantom traffic signal recognition when parked facing a convenience store, etc.   Even if self-driving cars are still a bit elusive (need to work in every possible condition), I’ve established my mantra: Self Driving Mainframes!!  I think this vision is at hand, and I’m on a tear to get it done!! 

 

Best kept secret

woman shhhhing, finger to mouth!

www.pexels.com/search/shh/

A few months ago, I moved from an analytics group to the ITSM Monitoring and System Automation (SA) team.  I’ve been developing use cases, integrations for the OMEGAMON Data Provider (ODP).  You can see my results so far, in my previous blogs.  Go to the Resources links below to see what I’ve been up to, see how great ODP data is to work with: self-describing, years of curation, ready for ML and AI.  I’ve felt that the IBM ITSM suites had all the building blocks for AI, and the closer I look, the stronger this belief gets!!!  I recently expanded my scope beyond ODP, and teamed up with my colleague Jürgen Holtz, to develop OMEGAMON+SA use cases.  As Jürgen helps me to learn SA, I’m totally amazed!  The more I look, the more these two products have amazing features, that most folks don’t seem to know or remember.  It seems that these products are some of IBM’s best kept secrets!! 

 

AI for Novices

chart showing how ML, DL, and AI relate to each other

 

My team has been using the Detect, Decide, Act metaphor to structure the AI journey and conversation.  Here’s a couple of my definitions:  ML is “learning”, but more like extracting the “rules”, from the data (numbers, words, pictures, sounds, etc.).  This might mimic human rote learning.  When we are kids, we learn our first language, not from a grammar book, but kids still learn the rules of grammar, by constantly being feed data, well-formed sentences, over and over again, from our parents.  Machines and people learn the same way!!  For both humans and machines, it’s an optimization problem.  After hearing a few sentences, both start to correlate probabilistically past tense for example, even if some of the sentences are broken, not matching the rule, we still know what past tense looks like, since most of the data matches the rule we extracted!!  Once you let go of deterministic, human-created rules, and think in terms of probabilities, you get how machines and humans “learn”!!  Pretty simple actually.  Sorry if I ruin the mystique!  The hard part is the world is huge composites of many, many rules.  Deciding a zebra, from a horse, from an ambulance, from a delivery truck is a complex set of shapes, colors, lighting angles, backgrounds, motions, other objects moving in front, behind, virtually unlimited number of what practitioners call Features, which could be thought of as dimensions.  Now it doesn’t sound so simple does it?   Again,  Machine Learning is about optimizing cross likely hundreds of features -- mind blowing!!  

 

That’s ML.   Now AI is merely means taking action, given a set of circumstances, and given the rules that were learned during training.  The first language action we likely learn is to stop what we are doing, when our parents shout “STOP!!!”.  That same sentence works, whether we are throwing food, or hitting our baby brother.  Still AI is a probabilistic optimization problem, and we learn how to act also, based on probabilities.  If a kid is hitting her brother, and throwing his brother’s food, she probably gets that she needs to stop both in this case, even if mom wasn’t specific in your input data/wording (stopping both behaviors is the high probability action to take!  And the “reward” increases the “weight” of that action being appropriate, as mom stops shouting ;-)  )  Optimizing rewards is a branch of ML and AI, called Reinforcement Learning (RL).

 

Make it Real:  Tried by Dave!!  

 copyright IBM 2022

I’m not a fan of just waving my hands and describing good intentions.  I want to go in the lab and make sure its real.  Jürgen Holtz and I did just that.  We used a semi-real workload, that cycles up and down, as I’ve seen many customers workloads do across their business day, when they share their data with me.  I drive this workload to CICS Short-on-Storage (SOS) at times.  I know customers struggle with this.  I’ve found the one SOS event, after a customer sent me data for their hundreds of CICS regions.  Off topic, but IBM Z Log and Data Analytics (LDA) instantly found the SOS, when the right query was formed, very cool!!  Maybe I should blog about LDA sometime, as I’m a huge fan!  

Jürgen and I decided to have an OMEGAMON for CICS XE Situation issue a WTO to SYSLOG, and then have a System Automation (SA) policy to react to the WTO, to automate the action an operator would likely take for our CICS scenario. 

 

My scenario is this:  A customer applied an application update to an application over the weekend, and introduced a slow, steady memory leak, that escaped testing, since the leak is so slow.  Once development understood their test escape, they definitely improved their test process, but also gave operations a work around to use, while development worked on testing the fix.  Since the CICS regions are part of a CICSplex, its fine to recycle a region, when it gets to 95% CICS EDSA storage.  Operations set up the Situation and the SA policy, and magic, the mainframe is driving itself, totally hands off, even in the midst of a pernicious memory leak!  If we bake things like this into the product, before we know it, we won’t need operators at all!  Self-driving mainframes are close at hand, using all the building blocks available in our products!! 

  

artificial brain, circuits superimposed on a human head outline

  

But for this vision to be fully realized, I’ll need to develop some artificial brains, that capture all our operator habits.  I’m planning to get a bunch of metric, trace, and log data from mainframes, as well as all the operator actions.  Based on that data,  I can start to build an artificial brain that is able to pick actions an operator would likely take automatically!!  How cool is that!!  Please let me know when you are ready to share the data I crave.  Before I do that, I plan to work through OMEGAMON+SA scenarios that make sense.  For now, the artificial brains are SA created Netview Automation Tables (AT), which capture a lot of mainframe wisdom, a modest artificial brain.  

 

From the Lab:

 

Besides a controllable, customer realistic workload in the lab, which I hinted at above, the main things to demonstrate in the lab, and as follows:

  • Setting up one or more OMEGAMON for CICS XE Situations that use ZOSWTO, to WTO to SYSLOG
  • Setting up a System Automation policy to react to the WTO in SYSLOG

I created a Warning/Yellow Situation for 90% EDSA, and a  “Decide Now” Red Situation for 95% EDSA.  Notice I've used [1,2] (routing code, descriptor code) to create the Decide Now  Red Situation.

Here’s the two Situations and their matching WTOs in SYSLOG:

Warn situation

 



Decide now situation

 
From the TEP, we see that the situations have occurred.

Situations fired in TEP

 
and the ZOSWTO messages are in SYSLOG

WTO in syslog



I'm still working with our host admins to set up the automation.  Below is a sample row in the Netview Automation Table, where RSTCICS is an SA script

*

 IF MSGID = 'KO4104I'  & TOKEN(10) > 95

   THEN EXEC(CMD(‘RSTCICS’) ROUTE(ONE %AOFOPGSSOPER%));

*

Conclusion

I feel that self-driving mainframes are at hand.  Products like OMEGAMON and System Automation provide the building blocks.  We just need to create some artificial brains here and there.  While we talk about Detect, Decide, Act, many situations can have immediate, unattended action, if we have all the smarts we need in our artificial brain.  No need to the humans to decide the next action!  :-o

Resources


=> Subscribe to my YouTube channel, and other cool AIOps YouTube channels!!  https://www.youtube.com/user/drdavew00/featured
 
=> What to compare notes, hints, hit me up on social media:  drdavew00!
 
=> Comment below, let's start a good discussion on this topic!
 
=> To find more blogs and other great resources relating to the OMEGAMON Data Provider, you can find these resources  here.