Discover the runbook of IBM Z OMEGAMON AI for z/OS CPU Time Use Case where Predictive AI and machine learning help detect and alert of significant divergence revealing potential resource constraints ensuring performance and control related costs.
OMEGAMON AI Insights GA Version 2.2.0 - November 14, 2025
(10 min read)
Find all episodes of the podcast here: IBM Z® OMEGAMON® product demos!
Purpose & Scope
Purpose: Quickly determine if CPU divergence is due to volume surge, goal pressure, capacity/capping, or specific address spaces/jobs.
Data:
- SMF 70 (LPAR CPU capacity & capping)
- SMF 72 (WLM service class performance & delays)
- SMF 30 (address space/job CPU consumers)
Tools: Your analytics platform, OMEGAMON AI/Web UI, SMF Records.
When to Use
You received a notification or noticed that a z/OS Service Class or workload CPU Time has significantly diverged for the workloads you monitor, is impacting performance and risks impact increase related cost.
Expected Outcomes
Identify whether the cause is volume‑driven, capacity‑driven, for a specific address space, job or wider and who to dispatch to for further investigation, optimization, rebalancing or scheduling.
- Site Reliability Engineer - Cross system overview, trend charts, confirm baseline & scope (Step 1-4)
- WLM Owner - PI, importance, classification/periods, donors/receivers (Steps 2, 3)
- Capacity Planning - LPAR weights/caps, 4HRA MSU, zIIP provisioning (Steps 1, 3)
- Subsystem and Application Teams - top consumers & offload (Step 3, 4)
Context: As a Site Reliability Engineer (SRE), you received a notification of an anomaly detected on a Sysplex and Service Class of High importance for an abnormal CPU Divergence.
Step 1 - Scope Definition - Define “who diverged” and “from what” (by lpar, workload type, service class)
We first need to pinpoint the scope of the divergence (by lpar, workload type and service class) and define the baseline we’re diverging from.
At the subsystem/system level, SMF 70 and 72 consolidate CPU and capacity by lpar and workload, which helps you identify LPARs hosting most of the impacted service class work and whether an increase is isolated to one member or group‑wide.
Visibility by member or group matters because CPU usage alone is misleading unless you compare it capacity. An LPAR at 90% busy under a soft cap may be throttled even if CPC has spare cycles.
Dispatch: SRE validates with IBM Z OMEGAMON AI for z/OS or SMF 70/72 data; if capacity constraint suspected, involve capacity planning for weight/cap adjustments.
Reference: IBM OMEGAMON AI for z/OS - System CPU Utilization Workspace
Example:
Step 2 - WLM Service Class Goal Attainment & Delay Anatomy
What is important for an SRE is if the goals are met, what suspected cause to faster triage and dispatch to the right people.
This can be achieved by looking at several KPIs
- PI: PI > 1.1 sustained: WLM cannot give enough resources to meet the goal
- Importance: When CPU is constrained, WLM favors higher importance work
- Delays: Whether there is performance impact and it is expected depending on service class importance
At this stage we do not have delays but can look at:
Dispatch: if goal pressure suspected, involve WLM Policy Owner for goals or distribution adjustments.
Reference: IBM OMEGAMON AI for z/OS - Address Space Bottlenecks in Service Class Period workspace
Example:
- Confirmed: High importance service class, missing goal by 10%, volume driven.
Step 3 - zIIP Offload & “zIIP on CP”
In this step we need to look at zIIP CPU vs Capacity to understand if zIIP elligible offload failed and why.
If zIIP-eligible time executed on CP increase significantly compared to normal it could be an offload failure.
Correlate with Workload to identify service classes or jobs with high zIIP-eligible work (Db2 DDF, Java).
If zIIP work is in low-importance service class, WLM may deprioritize zIIP dispatch.
Dispatch:
- If zIIP failure or capacity suspected, involve Capacity Planning/WLM team for adjustments
- If specific jobs dominate zIIP on CP: engage Subsystem teams for workload tuning
Reference: IBM OMEGAMON AI for z/OS - WLM Service Class/Report Period attributes for Sysplex
Example:
- Confirmed: No zIIP offload failure
Step 4 - Identify Top Consumers (Address Spaces / Jobs)
In this last step we need to identify if the overconsumption comes from a particular job or address space to escalate.
Dispatch:
- If subsystem address spaces dominate, engage Db2/CICS/MQ System Programmers for tuning or workload redistribution.
- If specific batch jobs dominate, engage Application Team for optimization or rescheduling.
Reference: IBM OMEGAMON AI for z/OS - Address Space CPU Utilization attributes
Example:
- Confirmed: Job composition similar, *CHIN top jobs heavy contributors (MQ Channel Initiators or CICS Transaction Gateway)
- Conclusion: Dispatch to CICS/MQ System Programmer for investigation.
What Next?
The investigation for a Site Reliability Engineer would stop here where more Subject Matter Expert per Subsystems and Application focus would take over with OMEGAMON Web UI dashboards looking at more detailed information to ensure root cause isolation and corrective measures, whether policy tuning, capacity adjustments, or application optimization ...
In this case probably MQ or CICS System Programmers could investigate, but it seems like a one-off event so dispatch with a lower impact to keep track in case it would happen again.
Fortunately here this is not a big event and no performance issue involved but the beauty of AI is that it can analyze it all, while you remain in control of what you want to be alerted on, what matters to your shop, what should be automated or what a human should review.
But there is still room for optimization and cost reduction, doesn't matter your pricing model, here that's 300MSU overconsumption. In TFP, at an hypothetic discounted growth rate of 30$/MSU that's about 10k$.
Here another 300 on DDF workload, 50 on CICS, 35 on IMS, so another 10k$, and that's just a few examples of the diversity of things you can detect on z/OS workloads.
