AIOps: Monitoring and Observability - Group home

Introducing AI/ML for JVM Monitoring with IBM Z OMEGAMON AI for JVM, 6.1

  

Available on September 8th 2023, IBM Z OMEGAMON AI for JVM, 6.1 delivers new machine learning (ML) capabilities with a standalone offering that includes the new OMEGAMON AI Insights 1.1. It will also be available to suite customers in IBM Z Service Management Suite 3.1 and IBM Z Monitoring Suite 2.1.  The AI integration allows key performance indicators (KPIs) for Java Virtual Machines to be streamed to the AI Insights model to learn the time-based, seasonal variations patterns in these metrics. Thereafter, the model is used to identify anomalies in behavior when KPIs fall outside the previously established normal, seasonal variations. 

Installation

The standalone version of OMEGAMON AI for JVM is installed with SMP/E and includes the FMIDs listed in table 1.

FMID Description
HKJJ610 OMEGAMON AI for JVM, 6.1 base component
HKOB750 OMNIMON Base 7.5 including enhanced 3270UI
HKDS630 IBM Tivoli Monitoring, 6.3.0
HIZD320 IBM Discovery Library Adapter 3.2
HRKD560 OMEGAMON Integration Monitor DE, 1.1
HKOA110 OMEGAMON Data Provider, 1.1

Table 1: List of FMIDs delivered with IBM Z OMEGAMON AI for JVM, 6.1

The Bill of Materials includes the usual distributed platform components for IBM Tivoli Monitoring (ITM) such as the TEP and IZSME, but also includes the OMEGAMON AI Insights component which is installed on a Red Hat Enterprise Linux 8.6.x s390x (zLinux) system. In addition, customers must install an ELK stack (Elasticsearch, Logstatsh and Kibana) on the zLinux system (which is not included in the package).

Architecture

OMEGAMON AI for JVM works in the same way as it did in V5.5, using a JVM Tooling Interface (JVMTI) and Java agent sending monitoring data to the OMEGAMON for JVM Collector via cross-memory services to the ITM  infrastructure. OMEGAMON Data Provider (ODP) is used to stream this data to the ELK stack, and ODP leverages the Persistent Data Store (PDS) V2 component of ITMz which stores short term history.  In this release of OMEGAMON AI Insights, three ITM attribute groups are curated for monitoring JVMs: Java Heap and Garbage Collection (GCSUMM); JVM CPU (CPU); and z/OS Connect EE API Provider summary (ZCSUMM).

Configuration

To begin collecting data for ML training, you must enable history collection for tables GCSUMM, CPU and ZCSUMM. This is easily accomplished in either the TEP or the enhanced 3270UI.  In addition, ODP and a ZOWE cross-memory service must be installed and configured, and the KAYOPEN member of the RKANPARU data set for the OMEGAMON AI for JVM Runtime Environment (RTE) must be configured. This member contains YAML text that controls which tables are to be routed to ODP:

broker:              
  name: ZWESIS_JVM   
collections:         
  - product: kjj     
    table: GCSUMM    
    interval: 1      
    destination:     
      - open         
      - pds          
  - product: kjj     
    table: ZCSUMM    
    interval: 1      
    destination:     
      - open         
      - pds          
  - product: kjj     
    table: CPU       
    interval: 1      
    destination:     
      - open         
      - pds          

 
After configuration, the OMEGAMON for JVM TEMA will need to be restarted. The OMEGAMON for JVM AI model is a per-job model unlike the OMEGAMON AI for Networks and OMEGAMON AI for z/OS which are LPAR based.  This is because every JVM is unique. There is no "one-size-fits-all" JVM configuration that is optimal. Using nominally well-tuned JVMs, training data should be collected for at least two weeks. The model will learn the patterns of garbage collection, CPU utilization and z/OS Connect EE API response times as they vary throughout the days and weeks. Some basic level tuning should have been applied to the JVMs selected for training since a poorly tuned JVM will train the model to expect poor KPIs.  The objective is for OMEGAMON AI Insights to spot when behavior falls outside the forecast based on the training data.

Dashboards and Alerts

Unlike traditional performance analysis based on thresholds, OMEGAMON AI Insights effectively varies the thresholds based on prior knowledge of the application. The dashboards provided with AI Insights show the upper and lower bounds of the forecast as a grey overlay transom, with the sample data displayed as a line plot graph.  When the JVM is behaving as expected, the sample plot falls within this transom with only occasional lapses.

A misbehaving JVM will generally exceed the values from the forecast as shown as below:

In this example, garbage collection rates have exceeded the forecast for an extended period. Further analysis using other dashboards in OMEGAMON AI Insights, or deep dive into the data provided by OMEAGMON for JVM workspaces revealed the problem. In this case a failed back end CICS server for a z/OS Connect EE instance 

OMEGAMON AI Insights can be configured to send alerts based on rules.  See the documentation here: IBM Z OMEGAMON AI Insights - IBM Documentation

Other enhancements in this release

Some features that were deprecated for use in OMEGAMON for JVM 5.5 are now removed from 6.1. The use of the Java Health Center as a monitoring data source for the OMEGAMON for JVM Java agent is no longer supported. The Health Center agent and monitoring API are no longer shipped with OMEGAMON for JVM, 6.1.  Configuration of a JVM for monitoring now only requires two parameters to be added to the JVM startup options for most JVMs

-agentpath:/rtehome/rtename/kan/bin/IBM/libkjjagent_64.so

-javaagent:/rtehome/rtename/kan/bin/IBM/kjj.jar

One exception is that z/OS Connect EE servers require an additional option:

-Xbootclasspath/a:/rtehome/rtename/kan/bin/IBM/kjjboot.jar

This release also adds some useful new attributes to help identify Java heap problems. The default concurrent generational garbage collection (GC) policy endeavors to limit the number of stop-the-world global GCs. Faster, cheaper nursery collections can usually reclaim enough memory to resolve an allocation failure for small, short-lived objects. When the tenure area becomes full of longer term and larger objects, the total amount of occupied heap after garbage collection can appear worryingly high. Indeed, an ITM situation may fire based on this metric. In reality, the next global garbage collection will likely reclaim large areas of memory from the tenure area, reducing the occupancy to much more reasonable levels. In V6.1, a new attribute is introduced to surface the heap occupancy after global GC. A new product provided situation that uses this attribute as a measure of heap occupancy will eliminate the occasional false-positive alert that could occur in prior releases. There is also a new attribute for the maximum heap size limit (defined by the -Xmx JVM option), plus an attribute caption change for "Heap Size" to "Comitted Heap Size" to clearly identify it as the current amount of memory reserved for heap use, which is a value between the initial and maximum heap sizes specified by -Xms and -Xmx JVM options.

For more information about this new release  see: IBM Z OMEGAMON AI for JVM

#OMEGAMON #AIOpsonZ #zos-connect