AIOps: Monitoring and Observability

AIOps: Monitoring and Observability

AIOps: Monitoring and Observability

Comprehensive performance monitoring and observability of your IBM Z environment. Solutions include OMEGAMON, Service Management Suite for z/OS, and IBM Z Application Performance Management Connect

 View Only

Runbook - CICS Use Case #1 - CICS Region Transactional Response Time Divergence

By Fabien Gautreault posted Wed March 04, 2026 06:26 AM

  

Discover the runbook of IBM Z OMEGAMON AI for CICS Transactional Response Time Use Case where Predictive AI and machine learning help detect and alert of significant divergence revealing potential resource constraints ensuring performance and protection of the transactional workload.


OMEGAMON AI Insights GA Version 2.2.0 - November 14, 2025

(10 min read)



Find all episodes of the podcast here: IBM Z® OMEGAMON® product demos!  

Purpose & Scope

Purpose: Rapidly determine root cause when the average transaction response time in a CICS region has significantly increased over the last hour compared to historical baseline.

Scope:

  • SMF 110 Performance Class (transaction-level timing: CPU, dispatch, suspend).
  • CICS Statistics (SMF 110 subtype 2) for region-level health.

Tools: Your analytics platform, OMEGAMON AI/Web UI, SMF Records, CICS PA.

When to Use

You received a notification or noticed that CICS regions Response Time has significantly diverged and is impacting performance of transactional workload.

Expected Outcomes

As CICS is supporting critical business, identify quickly whether the cause is volume‑drivenoutlier‑driven (a small set of transactions skewing the average), or systemic. Assess the blast radius and who to dispatch to for further analysis.

  • Site Reliability Engineer - Cross system overview, trend charts, first‑pass analysis.
  • Application Team - Volume driven, workload shift or unplanned spike, MaxTask pressure.
  • CICS Systems Programmer - Broad resource contention, suspend/dispatch delays, subsystem latency.
  • Subsystem Team - Db2/MQ RMI delays, file string shortages, abends surge.




Context: As a Site Reliability Engineer (SRE), you received one notification of anomalies detected on several CICS regions simultaneously for an abnormal Transactional Response Time Divergence.


Step 1 - Scope Definition - Define “who diverged” and “from what” (by regions)

We first need to pinpoint the scope of the divergence (by region) and define the baseline we’re diverging from.

  • Check average response time trend - Compare the last 2 hours against the historical same day-of-week and same hour baseline

  • Identify high‑volume vs high‑impact drivers

    • If the number of transactions increased significantly, it is volume‑driven.
    • If volume is stable but response time increases, it is likely CPU, dispatch, I/O or RMI waits.
  • Assess skew contributors - A handful of failing or long‑running transactions can inflate the average, check distribution, mix or dictionary would require a CICS SME.

Dispatch: 

  • Volume driven - Application Team to validate workload shifts or unplanned spikes.
  • Volume remains stable or more detailsCICS System Programmer, as degradation is likely internal (dispatch, suspend, resource contention).

Reference: IBM Z OMEGAMON AI for CICS - Response Time Analysis

Example:

Check average response time trend - Compare the last 2 hours against the historical same day-of-week and same hour baseline
Check average response time trend - Compare the last 2 hours against the historical same day-of-week and same hour baseline
If the number of transactions increased significantly, it is volume‑driven, If volume is stable but response time increases, it is likely CPU, dispatch, I/O or RMI waits

  • Confirmed: Response time anomalies on several regions at the same time - volume driven - 1 notification received for more than 15 anomalies - confirm impact and involve SWAT team
  • Baseline: Different baseline for each region and very high divergence compare to previous week




Extraordinary Step - Size the blast radius

The Blast Radius refers to the extent of a performance "contagion". We first need to make sure how much of performance impact this event is and eventually dispatch to higher severity and SWAT team.

Dispatch: 

  • SWAT team or CICS System Programmer depending on impact

Reference: ---

Example:

The Blast Radius refers to the extent of a performance
The Blast Radius refers to the extent of a performance
The Blast Radius refers to the extent of a performance

  • Confirmed: Impact on 2/3 LPARs and more than 15 regions - Involve SWAT team and increase to a critical severity event for immediate action




Step 2 - Gathering clues

There is not much the SRE can do alone than looking for more symptoms at this point.

When 15 regions across 2 out of 3 LPARs are affected, you aren't looking at a coding bug in one program but at a shared resource or infrastructure bottleneck that is common to those 15 regions but absent or healthy on the 3rd LPAR.

Thousands of abends following the response time spike on the 2 LPARs only.

Thousands of abends following the response time spike on the 2 LPARs only

Most time spent waiting on first dispatch before Db2 RMI or FC Read starts kicking in.

Most time spent waiting on first dispatch before Db2 RMI or FC Read starts kicking in

Several regions hitting the max tasks.

Several regions hitting the max tasks
Several regions hitting the max tasks




What Next?

The investigation for a Site Reliability Engineer would stop here where more Subject Matter Expert per Subsystems and Application focus would take over with OMEGAMON Web UI dashboards looking at more detailed information to ensure root cause isolation and corrective measures quickly:

  • Capacity Planning / Network Team (to find the source of the traffic surge)
  • CICS SME (to suppress non-critical dumps and stabilize the regions, investigate with OMEGAMON AI for CICS looking at standard deviation of CPU and response for example)
  • Application Dev / DevOps (to stop the retry loop from the source as thousands of abends for an hour suggests)

---

Based on a real and authorized customer dataset, the AI models would have detected real performance anomalies before they become outages.

The AI models continuously learn each region’s normal behavior and surface only true deviations, eliminating the fatigue and noise created by static thresholds.

This gives SREs earlier warningclearer context, and drastically faster triage, which directly reduces Mean Time to Detect (MTTD) and Mean Time to Resolve (MTTR).

For customers, the return is simple: less downtime, faster recovery, fewer false alerts, and better use of expert time, at enterprise scale.

OMEGAMON AI does not replace SMEs, it amplifies their impact by providing them clean, high‑quality signals instead of raw data streams. It ensures they spend time solving real problems, not finding them.




We want to hear from you!

Have you faced hidden performance issues? Curious how AI could help?
👉 Share your story on IBM Idea portal or request a demo today.

📖 Read how OMEGAMON AI gives the possibility to solve problems before these impact the end user experience

🛠️ Explore the productIBM Z OMEGAMON AI Insights official documentation and release note

🎥 See other Product Runbooks for Db2CICS or z/OS.


#monitoring, #ArtificialIntelligence(AI), #IBMZ, #OMEGAMON, #CICS, #AnomalyDetection, #IBMAI#OMEGAMONAIInsights

@Mick Harris, @John Hancy, @Aleksandr Charcikov, @Ezriel Gross@Ash Mahay, @Jim Porell, @Anna Murray, @Fabien Gautreault

0 comments
38 views

Permalink