AIOps: Monitoring and Observability

AIOps: Monitoring and Observability

AIOps: Monitoring and Observability

Comprehensive performance monitoring and observability of your IBM Z environment. Solutions include OMEGAMON, Service Management Suite for z/OS, and IBM Z Application Performance Management Connect

 View Only

Runbook - Db2 Use Case #1 - DSG Group and Connection type CPU Divergence

By Fabien Gautreault posted yesterday

  

Discover the runbook of IBM Z OMEGAMON AI for Db2 CPU Time Use Case where Predictive AI and machine learning help detect and alert of significant divergence revealing potential resource constraints or unexpected workload shifts ensuring performance and protection of the transactional workload.


OMEGAMON AI Insights GA Version 2.2.0 - November 14, 2025

(10 min read)



Purpose & Scope

Purpose: Rapidly determine root cause when a Db2 data sharing group or member shows CPU divergence for CICS and/or DDF connections over ≥ 30–120 minutes versus same day/time historical baseline.

Data:

  • SMF 101 (Db2 Accounting) with Classes 1/2/3; include DDF zIIP accounting.
  • SMF 100 (Db2 Statistics) per member & connection type.

Tools: Your analytics platform, OMEGAMON AI/Web UI, SMF Records, Db2 Accounting & Statistics.

When to Use

You received a notification or noticed that Db2 CPU Time has significantly diverged for the workloads you monitor, is impacting performance and risks transactional workload disruption.

Expected Outcomes

Identify whether the cause is volume‑driven, contention‑driven (locks/latches), inside or outside Db2, member skew or group wide and who to dispatch to for further GBP/CF stress, zIIP offload loss, application commit/rollback churn... analysis.

  • Site Reliability Engineer  — Cross system overview, trend charts, first‑pass analysis (Steps 0–4).
  • Db2 Systems Programmer — Routing imbalance, IRLM dispatch, BPs, etc. (Step 2, 4, Next).
  • Application DBA / CICS or DDF App Team — Application activity surge, Commit discipline, SQL/package strategy, etc. (Step 2-4, Next).
  • Capa/WLM/CF Specialist — Service Class/importance, zIIP entitlement/offload (Steps 2, Next).




Context: As a Site Reliability Engineer (SRE), you received a notification of an anomaly detected on a Data Sharing Group and Connection Type DDF for an abnormal CPU Divergence.


Step 1 — Scope Definition - Define “who diverged” and “from what” (by member & connection type)

We first need to pinpoint the scope of the divergence (by Db2 member and connection type) and define the baseline we’re diverging from.

At the subsystem/system level, SMF 100 Statistics consolidate CPU and wait time by connection type, which helps you see member‑to‑member skew and whether an increase is isolated to one member or group‑wide.

Visibility by member matters (to exclude a single hot member, misrouted workload, or GBP/CF locality effects). Comparing activity by member and connection type will help detect imbalances. 

Dispatch: SRE validates SMF 101/100 data; if routing imbalance suspected, involve Db2 Systems Programmer.

Reference: IBM Db2 for z/OS - Statistic traces

Example:

At the subsystem/system level, SMF 100 Statistics consolidate CPU and wait time by connection type, which helps you see member‑to‑member skew and whether an increase is isolated to one member or group‑wide
Comparing activity by member and connection type will help detect imbalances
  • Confirmed: Both members diverged, CICS and DDF impacted.
  • Baseline: Previous weeks same day/time.




Step 2 — CPU Breakdown - Quantify the increase and split CPU classes

We need to know where the extra CPU is spent, look out for DDF if zIIP offload changed, WLM policies:

  • Is it inside Db2 execution (Class 2)?
  • Or outside Db2 (Class 1 > Class 2)?

This tells us if the root cause is SQL/Db2 internals or application/transaction logic outside Db2.

Dispatch: If Class 1 surge is app-driven, involve CICS/DDF Application Team

Reference: Investigating Class 2 CPU Times

Example:

This tells us if the root cause is SQL/Db2 internals or application/transaction logic outside Db2

  • Confirmed: Class 1 > Class 2 drift → extra time outside Db2 logic but both are increasing.




Step 3 — Volume vs Efficiency - Check transaction volume vs. CPU per transaction

We need to separate volume effect (more transactions) from efficiency effect (same volume but more CPU per unit).

  • If transaction volume increased, CPU rise might be expected (though still worth checking).
  • If CPU per transaction increased, that’s a strong indicator of SQL plan changes, data access path drift, or application logic loops.

Dispatch: If no immediate SQL tuning; focus on concurrency and resource contention.

Reference: Db2 Accounting and Response Times

Example:

If CPU per transaction increased, that’s a strong indicator of SQL plan changes, data access path drift, or application logic loops
  • Confirmed: Transaction count doubled compared to previous Fridays and CPU surge correlated → volume surge primary driver.
  • Confirmed: Stable → no SQL regression.




Step 4 — Suspension & Not Accounted Time - Investigate suspension and not accounted times drivers

We need to see if the extra waits are due to Coupling Facility (GBP), lock contention, or I/O bottlenecks.

Suspected Causes:

  • IRLM lock contention (commit/rollback churn).
  • Buffer pool latch contention (I/O increase).

Dispatch: Db2 Systems Programmer for IRLM priority and latch analysis.

Reference: Investigating Class 3 Suspension Time

We need to see if the extra waits are due to Coupling Facility (GBP), lock contention, or I/O bottlenecks.

  • Confirmed: Lock/latch spike correlates with Not Accounted Time




What Next?

The investigation for a Site Reliability Engineer would stop here where more Subject Matter Expert for Db2 and Application focus would take over with OMEGAMON Web UI Db2 dashboards looking at Packages, Application metrics, Buffer Pools, CF stress...

Further Lock/Latch analysis would reveal a runaway SYSLH200 dynamic SQL package

Further Lock/Latch analysis would reveal a runaway SYSLH200 dynamic SQL package with heavy commit/rollback.

Further Lock/Latch analysis would reveal a runaway SYSLH200 dynamic SQL package with heavy commit/rollback.

This dynamic SQL flood and possible deadlocks reflects that the volume of transaction itself was in this case unusual and anomalous.

After rolling out the previous version the Application DBA and Developer would actually find that a new JDBC-call with a 9-way join with a bad accesspath was the cause...

Highlighted by the AI models 3 days before!

Based on authorized customer dataset, the AI models would have detected the divergence 3 days before the obvious spike, saving on 45k CPU seconds...

Based on a real and authorized customer dataset, the AI models would have detected the divergence 3 days before the obvious spike, saving on 45k CPU seconds of overconsumption...




We want to hear from you!

Have you faced hidden performance issues? Curious how AI could help?
👉 Share your story on IBM Idea portal or request a demo today.

📖 Read how OMEGAMON AI gives the possibility to solve problems before these impact the end user experience

🛠️ Explore the productIBM Z OMEGAMON AI Insights official documentation and release note


#monitoring, #ArtificialIntelligence(AI), #IBMZ, #OMEGAMON, #Db2, #AnomalyDetection

@Matthias Tschaffler, @Ash Mahay, @Jim Porell, @Anna Murray, @Fabien Gautreault

0 comments
26 views

Permalink