AIOps: Monitoring and Observability - Group home

Introducing OMEGAMON AI Insights

  

AI is coming, AI is coming! And it has arrived within several of the IBM Z OMEGAMON agents. OMEGAMON AI Insights is a new feature that can be added to OMEGAMON agent offerings in order to establish performance baselines and from there, detect anomalies against that baseline. This feature has a staged delivery. At this time, it's been added to IBM Z OMEGAMON AI for z/OS V6.1, IBM Z OMEGAMON AI for Networks V6.1 and IBM Z OMEGAMON AI for JVM V6.1. Each of those new versions of the agents is also available as announced in the following letter. 

This initial offering of OMEGAMON AI Insights uses a curated list of key performance indicators (KPI) that are streamed from the OMEGAMON Agents via OMEGAMON Data Provider where they can be used to train, predict and alert in the case anomalies are found in the execution of the business environment. 

Architecture

AI modeling begins with the OMEGAMON Data Provider streaming a set of metrics into an Elastic Logstash server. Data is then moved to reside an an Elastic Elasticsearch database, where the OMEGAMON AI Insights feature can access the data. Initial access to the data is to train the model for that particular KPI. It takes about two weeks of production workload to train the model and establish a baseline of "normal performance". Subsequent new data captured is then compared to the baseline that was established for the KPI. Anything outside the baseline is written back to the Elastic Elasticsearch database as an anomaly. Based on the particular KPI rules set up by a customer, an alert can be triggered which will send an email to interested parties to notify them of the anomalies detected. An Elastic Kibana dashboard is available for viewing of these results.

OMEGAMON AI Insights is a Python program running within a Linux for z image. It will leverage any Elastic image. Installation instructions include how to deploy the Elastic infrastructure within Linux for z. Customers have the choice to modify their existing Elastic infrastructure to service OMEGAMON AI Insights as well. 

Training the model

The training goal is to create or update an existing forecast model. The model granularity varies with the forecast set up for each KPI. For example, z/OS models for MSU utilization may be for each service class within a Parallel Sysplex or an individual LPAR.  Based on work with sponsor users, since billing for MSU's occur across a Parallel Sysplex, those customers have chosen to check the MSU utilization hourly across their Sysplex. For Networks, the model might be against the volume of data traffic within individual LPARs and checked hourly.  These are customer choices.

Predicting anomalies based on model execution

Using SMF data collected from a sponsor customer and tested against our CPU consumption model, it was determined that the customer was using an excessive amount of CPU for several weeks compared to their baseline. This could happen for a variety of reasons:

  1. New workload was added to the environment, which required an update to the model via new training
  2. There was an error in deploying new application to the environment. 

For this particular case, it was an error in missing a deployment step for a new application. This resulted in over 60,000 euros in excess licensing fees for that time period. Deployment of OMEGAMON AI Insights in real time would have triggered an alert within 60 minutes of the mistaken application deployment and resulted in avoiding much of the excessive software license fees. 

For network transmission rates, there may be normal batch jobs that transmit data routinely. But if the algorithm detects excessive data transmission, it could be related to: 

  1. a special one time task that is approved
  2. an attempt to steal data

In this case, additional information is provided as to the job requesting the excessive data and an investigation can occur. A lack of data at a normal peak time might be a sign of a network connectivity issue.  This may also be an important investigation to ensure network service levels can be achieved. 

Alerting

When,  how often and to whom an alert gets generated can be set up separately for each KPI's deployment model. In some cases, such as the excessive CPU or network behaviors described above, a business might want to be alerted immediately upon first issuance of an anomaly, but then wait for a few more anomalies for a subsequent alert. In other cases, such as Java virtual machine anomalies, an excess utilization might resolve itself quickly. Instead, it might be the fifth to tenth anomaly detected in a row that warrants the first alert. At this time, Elastic Kibana is used to set up the alert infrastructure. And from there, when an anomaly worthy of alerting is detected, an email will be sent out from the OMEGAMON AI Insights code. 

Complimentary to existing OMEGAMON Situations

So an anomaly is detected and alerted via OMEGAMON AI Insights. In many regards, this might be considered a "situation" has occurred. But it's a new situation. Once the root cause is determined, using existing OMEGAMON user interfaces, a situation can be created with some automation to attempt to have this situation prevented in the future. Alerts can be set up for that. Now, the goal is that OMEGAMON AI Insights will predict the possibility of the situation before it occurs and send an alert out. But there is nothing wrong with both alerts if it will help improve the availability of the operational environment. 

OMEGAMON AI Insights Dashboards

 These dashboards consist of an upper half that identifies the baseline in the gray area and anomalies as red dots. The lower half of the dashboard contains other OMEGAMON attributes that are intended to be important toward getting to the root cause of an issue. Traditional OMEGAMON user interface workspaces can also be leveraged for deeper dives into the performance metrics at the time of an alerted anomaly. 


IBM Z OMEGAMON AI for z/OS Dashboard
IBM Z OMEGAMON AI for JVM Dashboard
IBM Z OMEGAMON AI for Networks Dashboard

A plea for customer production performance data

Modelling is only as good as the data supplied that builds the model. The development team has been utilizing SMF data gathered from sponsor customers. The development team needs more data across a greater variety of SMF records to continue to rapidly evolve the usage of AI/ML across other agents and other KPIs. The data that has been received is only used for this purpose and a summary report is provided back to the customer after analysis. The SMF data requested has no personally identifiable data and a minimum amount of customer specific data that can be redacted without affecting the model. If you would like to share production data with us to help expedite our developmentl, click here to send an email making that request and we will get back to you quickly. 

Summary

This is another step in the journey for increased use of Artificial Intelligence and Machine Learning within OMEGAMON monitoring offerings. The documentation for OMEGAMON AI Insights can be found here

We'll be including updates and usage scenarios within the Master Blog for OMEGAMON Data Provider. Don't hesitate to look there for additional information. 

#CICS 
#db2z/os
#IBMMQ
#IBMZ
#IBMZOS
#IBMAI
#IMS
#Instana
#jvm
#OMEGAMON

#IBMChampion