AIOps: Monitoring and Observability

AIOps: Monitoring and Observability

AIOps: Monitoring and Observability

Comprehensive performance monitoring and observability of your IBM Z environment. Solutions include OMEGAMON, Service Management Suite for z/OS, and IBM Z Application Performance Management Connect

 View Only

Runbook - z/OS Use Case #1 - z/OS Sysplex and Service Class CPU Divergence

By Fabien Gautreault posted Mon February 16, 2026 09:48 AM

  

Discover the runbook of IBM Z OMEGAMON AI for z/OS CPU Time Use Case where Predictive AI and machine learning help detect and alert of significant divergence revealing potential resource constraints ensuring performance and control related costs.


OMEGAMON AI Insights GA Version 2.2.0 - November 14, 2025

(10 min read)



Find all episodes of the podcast here: IBM Z® OMEGAMON® product demos!  

Purpose & Scope

Purpose: Quickly determine if CPU divergence is due to volume surgegoal pressurecapacity/capping, or specific address spaces/jobs.

Data:

  • SMF 70 (LPAR CPU capacity & capping)
  • SMF 72 (WLM service class performance & delays)
  • SMF 30 (address space/job CPU consumers)

Tools: Your analytics platform, OMEGAMON AI/Web UI, SMF Records.

When to Use

You received a notification or noticed that a z/OS Service Class or workload CPU Time has significantly diverged for the workloads you monitor, is impacting performance and risks impact increase related cost.

Expected Outcomes

Identify whether the cause is volume‑drivencapacity‑driven, for a specific address space, job or wider and who to dispatch to for further investigation, optimization, rebalancing or scheduling.

  • Site Reliability Engineer - Cross system overview, trend charts, confirm baseline & scope (Step 1-4)
  • WLM Owner - PI, importance, classification/periods, donors/receivers (Steps 2, 3)
  • Capacity Planning - LPAR weights/caps, 4HRA MSU, zIIP provisioning (Steps 1, 3)
  • Subsystem and Application Teams - top consumers & offload (Step 3, 4)




Context: As a Site Reliability Engineer (SRE), you received a notification of an anomaly detected on a Sysplex and Service Class of High importance for an abnormal CPU Divergence.


Step 1 - Scope Definition - Define “who diverged” and “from what” (by lpar, workload type, service class)

We first need to pinpoint the scope of the divergence (by lpar, workload type and service class) and define the baseline we’re diverging from.

At the subsystem/system levelSMF 70 and 72 consolidate CPU and capacity by lpar and workload, which helps you identify LPARs hosting most of the impacted service class work and whether an increase is isolated to one member or group‑wide.

Visibility by member or group matters because CPU usage alone is misleading unless you compare it capacity. An LPAR at 90% busy under a soft cap may be throttled even if CPC has spare cycles.

Dispatch: SRE validates with IBM Z OMEGAMON AI for z/OS or SMF 70/72 data; if capacity constraint suspected, involve capacity planning for weight/cap adjustments.

Reference: IBM OMEGAMON AI for z/OS - System CPU Utilization Workspace

Example:

Machine learning algorithms compute the baseline based on previous weeks seasonality and patterns
Comparing activity by lpar, workload and service class will help detect imbalances
  • Confirmed: Load balancing activated, lpars taking the STC workload evenly, *CHIN as top jobs, no capping.
  • Baseline: Previous weeks same day/time.




Step 2 - WLM Service Class Goal Attainment & Delay Anatomy

What is important for an SRE is if the goals are met, what suspected cause to faster triage and dispatch to the right people.

This can be achieved by looking at several KPIs

  • PI: PI > 1.1 sustained: WLM cannot give enough resources to meet the goal
  • Importance: When CPU is constrained, WLM favors higher importance work
  • Delays: Whether there is performance impact and it is expected depending on service class importance

At this stage we do not have delays but can look at:

  • Response Time: If this increases significantly vs baseline -> performance degradation

  • Volume vs Response Time:

    • If transaction count and response time increase proportionally, likely volume-driven.
    • If transaction count stable but response time increase, likely resource contention (CPU, I/O, or capping)

  • CPU vs Elapsed Time: If CPU time increase but elapsed time increase even more -> indicates waiting

Dispatch: if goal pressure suspected, involve WLM Policy Owner for goals or distribution adjustments.

Reference: IBM OMEGAMON AI for z/OS - Address Space Bottlenecks in Service Class Period workspace

Example:

Looking at PI to see if the service class is served with the necessary ressources
Looking at delays to see if there is performance impact beyond CPU consumption divergence

  • Confirmed: High importance service class, missing goal by 10%, volume driven.




Step 3 - zIIP Offload & “zIIP on CP”

In this step we need to look at zIIP CPU vs Capacity to understand if zIIP elligible offload failed and why.

If zIIP-eligible time executed on CP increase significantly compared to normal it could be an offload failure.

Correlate with Workload to identify service classes or jobs with high zIIP-eligible work (Db2 DDF, Java).

If zIIP work is in low-importance service class, WLM may deprioritize zIIP dispatch.

Dispatch: 

  • If zIIP failure or capacity suspected, involve Capacity Planning/WLM team for adjustments
  • If specific jobs dominate zIIP on CP: engage Subsystem teams for workload tuning

Reference: IBM OMEGAMON AI for z/OS - WLM Service Class/Report Period attributes for Sysplex

Example:

Looking at zIIP to see if zIIP elligible workload is run on CP abnormaly

  • Confirmed: No zIIP offload failure




Step 4 - Identify Top Consumers (Address Spaces / Jobs)

In this last step we need to identify if the overconsumption comes from a particular job or address space to escalate.

Dispatch: 

  • If subsystem address spaces dominate, engage Db2/CICS/MQ System Programmers for tuning or workload redistribution.
  • If specific batch jobs dominate, engage Application Team for optimization or rescheduling.

Reference: IBM OMEGAMON AI for z/OS - Address Space CPU Utilization attributes

Example:

Looking at job composition so see if an outliers or job out of the ordinary
Filtering on a list of jobs main contributors to the divergence

  • Confirmed: Job composition similar, *CHIN top jobs heavy contributors (MQ Channel Initiators or CICS Transaction Gateway)
  • Conclusion: Dispatch to CICS/MQ System Programmer for investigation.




What Next?

The investigation for a Site Reliability Engineer would stop here where more Subject Matter Expert per Subsystems and Application focus would take over with OMEGAMON Web UI dashboards looking at more detailed information to ensure root cause isolation and corrective measures, whether policy tuning, capacity adjustments, or application optimization ...

In this case probably MQ or CICS System Programmers could investigate, but it seems like a one-off event so dispatch with a lower impact to keep track in case it would happen again.

Looking at the broader picture to better understand the impact and dispatch with the right severity

Fortunately here this is not a big event and no performance issue involved but the beauty of AI is that it can analyze it all, while you remain in control of what you want to be alerted on, what matters to your shop, what should be automated or what a human should review.

But there is still room for optimization and cost reduction, doesn't matter your pricing model, here that's 300MSU overconsumption. In TFP, at an hypothetic discounted growth rate of 30$/MSU that's about 10k$.

Here another 300 on DDF workload, 50 on CICS, 35 on IMS, so another 10k$, and that's just a few examples of the diversity of things you can detect on z/OS workloads.

image

image
image
Based on a real and authorized customer dataset, the AI models would have detected small divergence, saving on thousands of CPU seconds of overconsumption...




We want to hear from you!

Have you faced hidden performance issues? Curious how AI could help?
👉 Share your story on IBM Idea portal or request a demo today.

📖 Read how OMEGAMON AI gives the possibility to solve problems before these impact the end user experience

🛠️ Explore the product: IBM Z OMEGAMON AI Insights official documentation and release note

🎥 See other Product Runbooks for Db2CICS or z/OS.


#monitoring, #ArtificialIntelligence(AI), #IBMZ, #OMEGAMON, #zOS, #AnomalyDetection, #IBMAI#OMEGAMONAIInsights

@JOE WINTERTON, @Ash Mahay, @Jim Porell, @Anna Murray, @Fabien Gautreault

0 comments
33 views

Permalink