AIOps on IBM Z - Group home

UPDATE! A recent storm taught us we need to be more prepared!

  
Mainframes holding back a door with a twister storming outside

Copyright 2023 IBM Corporation
 

Introduction

Last July, I blogged about being prepared for mainframe “storms”, using time series forecasting.  We were lucky, or unlucky, depending on who you talk to, to have a real storm over the New Year’s holiday, and it taught me a bit more about being prepared.  Even if we had great time series forecasting software available, I learned there are many other aspects to being prepared!   If you haven’t read my prior forecasting blog, you can catch up here:  https://community.ibm.com/community/user/ibmz-and-linuxone/blogs/dave-willoughby/2022/07/27/be-prepared-with-time-series-trending-ml-of-omegam

 

Tried by Dave

Cartoon Dave in lab coat goggles in front of nerdy chalk board


As we are reusing the prior forecasting blog’s concepts, we’ll dive right into what I tried in the lab!   #triedbydave


Surprise storm
 

I had a pleasant surprise recently.  I started looking at SYSLOG, as streamed into ELK via the Common Data Provider (CDP).  I immediately saw an interesting message “storm” during the New Year’s holiday.  This was on an internal system that has a simulated, real-life workload, where, for example, one of the main CICSplexes has a million transactions every five minutes.

ELK Discover showing histogram of now messages, followed by surge of messages year end


I thought for sure I could use this real-world example, as a proof point for my time series forecasting concept, where the intent is monitor OMEGAMON Data Provider (ODP) metrics to give warnings before the storm hits (again see my blog above for the concept). 

It turns out, I couldn’t find any ODP metric nor CDP message progression, that would have foretold of this storm.  So, this blog, is about what I learned along the way, that uncovers additional preparations we might need, to predict this kind of storm.

Three strikes

It turns out, that I didn’t have my CDP running when the message storm started, so I wasn’t able to see the initial set of messages, that might have showed a trend.  Strike 1 on having forewarning.

I cheated and talked to the system owner, and found out that a tape library failed, causing Db2 and IMS log archiving to start failing.  I don’t know IMS well, so I tried very hard to find forewarning in the Db2 Archive Log, an upward trend to 100% full.  I learned that OMEGAMON for Db2 (OMPE) doesn’t trace Db2 Archive Log usage (percent full).  I did find it’s easy to track the underlying archive storage pool or dataset to exhaustion, but we haven’t set up OMEGAMON for Storage yet, sigh.  I think this was my best chance but Strike 2 on having forewarning.

I wondered if the Db2 Active Log might get stuck at 100%, if it wasn’t able to move Active Logs, to the Archive Log, but we were running the ODP Db2 Log Details attribute group, that shows Active Log usage.   Strike 3 on having forewarning, I’m out!!

 

Balance
 

I don’t know if there would have been a clear forecast signal, if we had coverage in the areas above, but my hunch is there would have been some forewarning in SYSLOG and the Db2 Archive Log storage pool.  In a study I saw years back, I’m remembering a conclusion that running out of permanent storage (DASD, HDD) is the leading form of resource outage.  I’m thinking we are overdue for monitoring permanent storage.  Part of our journey ahead will be about balance:  it’s quite impractical to track the thousands of metrics in ODP (I’ve heard 18,000 metrics are available), so we’ll need to balance coverage with processing and storage costs – but I think that’s best left for another time.


Storm zone

Db2 Distributed Data dashboard showing surge in Db2 DBATsDb2 Buffer Pool dashboard showing surge BP0, BP32K
IMS Lock Waiters surgez/OS zIIP usage goes to zero




I then turned my attention to looking through current ODP metric starter dashboards, to look for how this issue showed up in them.  I found “storm damage” in several dashboards: Db2 Distributed Data, Db2 Buffer Pools, IMS Home, and LPAR Clusters (above).  When doing this manual scanning of dashboards, it made me think, that our zAIOps portfolio has the Workload Interaction Correlator (zWIC) capability, that’s great at time domain metric correlation – finding metrics that change synchronously.  Also, the storms I show in these metric graphs would have easily been flagged by our zAIOps Anomaly Analytics (zAA) product, as well as the SYSLOG storm shown far above.  In many cases, the storm zone shown above, has two surges in metrics.  The owner of this system did several interventions during this overall storm, which I think explains the separate surges, or the last surge might be some catch up after the tape library came back online.  I see MQ related VTOC full messages in SYSLOG, but only late in the storm period (below).
 

histogram of surge in VTOC SYSLOG messages



Metric tricks


Before I reached the conclusions above, that Db2 attributes/metrics wouldn’t show me forecast potential, I ventured into creating my own ELK Discover visualizations for Db2 log statistics (kd5-statlog), still hoping to find forecast potential.  Here’s some lessons learned from my meandering.  We might be getting extra nerdy now! 

 

Delta metrics to the rescue

I had looked at the value of Write_NOWAIT_Requests in the Db2 log statistics, and thought I found a cool surge, coincident with the message storm, but after a while, I started to remember that Db2 metrics like these, are raw counters, that keep counting events cumulatively, only to be reset by a Db2 restart or LPAR re-IPL.  These tell-tale right triangles, plus the counts reaching into the billions, and then shoot down to zero at reIPL, are giveaways (below left).  At first, I used the difference setting in the ELK visualization tool, but it’s not smart enough to handle the drop to zero cleanly, causing huge negative spikes for the Db2 restart transition (below center).  Then I remembered I’ve seen “Delta” and “Rate” columns in the equivalent TEP tables, and when using the TEPS REST API, and sure enough, the ODP data matches exactly, has these Delta and Rate “columns” also.  When I switched to Delta, I got a valid view -- phew!  It’s great, that OMEGAMON provides this amount of data curation/reduction, out-of-the box!  (below right)

Line graph showing false surge in Db2 metricsLine graph showing false surge in Db2 metrics, helped some with ELK differencingLine graph showing false surge in Db2 metrics, DELTA version of metrics gives a valid view

 

Table styles

To go even nerdier, I saw for the first time, that there are several styles ODP attribute labeling and grouping when it comes to providing metrics to be displayed in tables in various UIs, like the TEP.  There are attribute groups, that have generic DELTA, RATE, VALUE, and SEQUENCE attributes, where the actual attribute/metric names are in the generic SEQUENCE attribute, so in the ELK visualization, I pulled in the top SEQUENCE enumerations, and kept increasing the number of top enumerations, until the list stopped growing, so that I knew that I had all of them in the visualization.  I show a capture of the STATLOG group, and see that GSTQXST, STATLOKC, and STATQXST use these generic labels, like SEQUENCE.  (below, left)

I then found DP_SRM_EDX and DP_SRM_LOX to be similar, but I’ll say they are semi-generic.  The attribute/metric name enumeration is in a semi-generic _IDENT tag, like EDM_IDENT or LOG_IDENT, and then there are semi-generic _TOTAL, _DELTA and _RATE, like EDM_TOTAL, EDM_DELTA, and EDM_RATE.  The X in the attribute/metric group names seems to indicate these have some extended precision values.  Also, I looked at these via the TEPS REST API, as I didn’t have these in my ODP streaming.  (not shown)

As a contrast, I saw DP_SRM_EDM, DP_SRM_LOG and DP_SRM_SUB, which does the enumeration in the labels themselves, no generic labeling, such as READ_ACTLOG_RATE, TOTAL_READS_ACTLOG, and DELTA_READS_ACTLOG and READ_OUTBUF_RATE, TOTAL_READS_OUTBUF, and DELTA_READS_OUTBUF.  I show DELTA_READS_ACTLOG, and some other DELTA_ attributes/metrics below.  (below right)

Line graph of generic table style metrics Line graph of non-generic table style metrics

To look at things another way, there’s separate JSON documents from ODP or the TEPS REST API for each row of these “tables”, for the first two styles above, where each row includes value, delta, and rate “columns” and all the rows, have the same timestamp, origin_node, resource name, etc.  The third style above has only one flat JSON document, with all the table rows and columns completely flattened into a single list.

In summary, for these Db2 event counters, it’s best to use the “delta” or “rate” version for analytics, either the DELTA field for each SEQUENCE enumeration, the _DELTA field for each _IDENT, or all the DELTA_ enumerations.  I know it’s hard to explain, I but wanted to help others that might stray into this area.  Seems I need a table to explain the table styles!!

 

Storage/memory metrics

While we are here, we might talk about the storage/memory metrics.  Often memory seems to be every increasing like counters, but after a quick thought, basic memory metrics is a case where we wouldn’t consider delta or rates, as often a subsystem ramps up to its given total memory footprint, and it very stable after that.  One aspect of Db2 memory I’ve seen in the past, is that there’s a significant “warm up” period.  I’ve seen Db2 memory take several hours to ramp up to a fairly stable size, and I’ve even seen, in the next week or so, Db2 memory grows a bit more, before becoming super stable.

 

Mental model for metrics

Prometheus definitions might help think about various categories of metrics discussed above: https://prometheus.io/docs/concepts/metric_types/  but their naming seems to relate to the most obvious visualization for a given type of metric.  From the table style discussion above, Prometheus would say that VALUE or TOTAL metrics are COUNTERS, and the RATE and DELTA metrics are GAUGES.   But to me, raw cumulative counters feel unnormalized, with limited analytics value.  Also within gauges, I feel there are subtypes.  When we focused on SYSLOG messages and Logs, I view these as events, where totals and deltas are whole numbers/integers, and for exception-type events, there can be long droughts with no events.  These event gauges may not have obvious upper bounds, while gauges like CPU, memory,  storage, queue percent have obvious 100% upper bounds, and these kind of resource usage gauges are rarely exact whole numbers.  So, in short, maybe there are exception event gauges, relatively continuous event gauges, both without obvious upper bounds, as well as relatively continuous resource usage gauges, either with or without obvious upper bounds.


Conclusions
 

While my hope for a forecasting proof point was dashed, I learned a lot about our current coverage, and how some attribute/metric groups can be tricky.  ODP opened a lot of opportunities for many users.  I’ve been blogging about ODP for a while, and hopefully this blog and other prior blogs can be helpful to others.  I talked to someone just today, someone using ODP, getting into various details, like I did in this blog.  After reading one of my blogs, he said he was going to go read all my blogs, since he loved the first blog he read. 

One final thought:  Having mainframe metrics and messages and other kinds of mainframe data in a modern, powerful analytics stack like ELK that I use, is extremely powerful for many, many use cases.  I have to say that when I came to the mainframe world, I was going bonkers trying to find things in SYSLOG, via 3270 ISPF SDSF, trying to use find, sort, etc.  As a long time UNIX guy, I just wanted to be able to GREP through SYSLOG, pipe results to other UNIX tools.  I did write a REXX program to pull SYSLOG into USS ZFS to be able GREP, but it was too cumbersome.  I have to say, with all the query and visualization capabilities in ELK, I know have tools much better than those UNIX tools!  


Resources

=> Subscribe to my YouTube channel, and other cool AIOps YouTube channels!!  https://www.youtube.com/user/drdavew00/featured
 
=> Want to compare notes, hints, hit me up on social media, email:  drdavew00, dwilloug@us.ibm.com!
 
=> Comment below, let's start a good discussion on this topic!
 
=> To find more blogs and other great resources relating to the OMEGAMON Data Provider, you can find these resources  here.