AIOps: Performance and Capacity Management

Members of this community will discuss end to end near-time collection, curation and reporting for simplified performance, cost and capacity management

View Only

Back to Blog List

Performance Management 101 for IMS Workloads

By Camila Vasquez posted yesterday

Written by Todd Havekost on November 12, 2025

Countless high-volume business-critical applications depend on the reliability, scalability, and responsiveness delivered by today’s mainframe environments. IMS subsystems often service workloads with thousands of transactions per second or more. Just as an automobile accident can snarl up freeway traffic for a long time, erratic transaction performance for high volume workloads can have a ripple effect causing outages or service delays that negatively impact customer satisfaction and business growth. As a result, effective performance management of high-volume transaction workloads is an essential aspect of a high functioning mainframe environment.

This blog presents concepts designed to enhance IMS performance management in your environment. Numerous scenarios will be presented, illustrated with reports from IBM Z IntelliMagic Vision.

Leveling Up from Guesswork to Substantiated Conclusions

Experienced mainframe performance analysts appreciate the rich measurement data produced by z/OS system and subsystem components. As you grow your skills and experience, you will become progressively less reliant on guesses and assumptions, and more proficient at reaching conclusions that are substantiated by objective Z measurement data.

“The lay of the land”

An effective performance analyst will quickly develop a strong awareness of key aspects of the overall workload. You will be able to answer questions such as:

What are your highest volume transactions?

- Are there correlations between transaction volumes and user complaints about their response times?

What are your top CPU-consuming transactions?

- Those transactions may be more sensitive to capacity constraints.

- They are also likely to be first in line for scrutiny when CPU reduction initiatives are launched

Figure 1 provides example “Top 10 Volume” (on the left) and “Top 10 CPU” (right) views for an IMS workload. Note that the identity and sequence of transactions are often quite different between the two lists.

Figure 1: Top 10 IMS Transaction Volume and CPU views.

Views like these of top workload drivers provide a great foundation for analysis that focuses on core workloads and protect against chasing side analyses that provide minimal value.

Understanding Baseline, ‘Normal’ Behavior

Another important element of performance management is understanding “normal” behavior, i.e., what are baseline response time profiles and activity levels.

The first chart in figure 2 shows that transaction VNML had a response time spike around 12:30 AM during the selected day, which on its own might seem to warrant additional analysis. But the second chart with a time-of-day profile across the past two weeks shows the response time commonly runs much higher during that early morning interval. We may still choose to investigate the factors leading to those higher response times, but being armed with an understanding of the common profile ensures we don’t waste analysis time assuming this was a unique occurrence on the originally selected day.

Figure 2: IMS Transaction Response Times.

While transaction response times are a key metric (especially because they are visible to users), activity metrics can also be very helpful in establishing baselines and identifying any deviations from that normal behavior. The first chart in figure 3 covering a two-week period shows a relatively consistent rate of IMS transaction processing across weekday prime shift hours but 2 days with significantly higher volumes in the evening hours. Noticing that those days were both Fridays raises the possibility that the workload profile may differ by day of the week. The second chart confirms that hypothesis, showing a significantly higher baseline activity level on Friday evenings compared to other weekdays.

Figure 3: IMS Transaction Volumes by Date and Day of Week.

Isolating anomalous behavior

When tasked with investigating a slow response incident, a systematic approach to isolating abnormal behavior enhances the problem determination process. This type of analysis is aided by tooling that enables you to rapidly answer the questions that inevitably arise at each step in the process.

Consider the example in chart 1 of Figure 4 where the IMS DWWI transaction has a response time spike at 4:15 PM on the selected day. A systematic approach to determining the scope of an issue contributes to speedy and effective root cause analysis by immediately eliminating some possible causes and (utilizing law enforcement terminology) suggesting an initial set of “persons of interest.“

In this scenario, a first question might be “did the slowdown occur across all IMS subsystems or just a subset?” Chart 2 indicates all IMS subsystems were impacted. Commonly slow response times result from CPU constraints on a CPC or z/OS system, and points to further analysis of RMF metrics like hardware CPU utilization (from RMF 70 data) and the WLM performance index for the service class where the IMS workload is executing (RMF 72.3). But armed with the configuration understanding that these IMS subsystems reside across multiple CPCs and systems, chart 2 immediately rules out that scenario by showing significant increases in elapsed time across all the serving IMS subsystems.

Instead, this information points our attention toward shared resources that could generate sysplex-wide impact. This commonly involves infrastructure elements involved in data access such as disk subsystems, databases, database managers, or serialization mechanisms associated with data.

Figure 4: Problem isolation process for transaction response time.

The IMS transaction data indicates the Program Type for transaction DWWI is CICS DBCTL, the facility that enables CICS transactions to access IMS DL/I databases. This application involves CICS transaction DWWI performing SQL calls to access Db2 data as well as IMS transaction DWWI issuing calls to IMS DL/I databases. Chart 3, ‘CICS Transaction Response Time,’ indicates a spike in CICS response time at the same 4:15 PM time. The CICS timing buckets indicate a condition where the transaction was dispatched (due to no CICS wait conditions) but not able to execute for some reason not visible to the CICS region.

Knowing that this CICS transaction utilizes Db2 SQL calls to access data, we look over into the (comprehensive) Db2 accounting data for Connection Type CICS Call Attach and Correlation ID DWWI (chart 4). There we find huge delays at the time in question due to Local Lock Contention (in orange) and Global Contention for L-Locks (in pink). Since the unit of work cannot be committed until all data accesses have completed for both IMS DL/I (not visibly delayed) and Db2 (seriously delayed), the overall response time for IMS transaction DWWI reports the spike we have been analyzing.

Identifying that the transaction elapsed time was seriously delayed across all IMS subsystems pointed us to look at shared resources, and indeed we found data access was delayed due to serialization contention.

Note that tooling that supports dynamic navigation, context-sensitive drill downs, and integrated visibility for components across the Z platform enables the types of problem isolation shown in this section to be completed rapidly and in an intuitive manner. Ultimate identification of root cause will often continue to rely on expertise provided by experienced analysts.

Optimizing performance and efficiency

Having a good handle on transaction response times and profiles is a good foundation, but the contributions of an effective performance analyst can extend to a much broader scope. In this final section we will introduce considerations of efficiency, utilizing health checks to get out in front of potential problems, and workload balancing and distribution with its sister discipline of designing for high availability.

Transaction workloads are often primary CPU consumers, and since mainframe expenses commonly correlate to CPU consumption (at least over the longer term), performance analysts can make important contributions to cost-reduction initiatives through identifying CPU efficiency opportunities. Early in this article the importance of familiarity with top 10 lists of transactions by volume and CPU consumption was highlighted. Identifying sizable changes in CPU per transaction (particularly increases) for these top transactions should be an important focus area, especially in connection with application release implementations.

Figure 5 represents the CPU time per transaction for the highest volume IMS transaction across two Mondays. If a new release of that application was implemented over the intervening weekend, this view confirms no material change in its CPU profile. Identifying any sizable jump in CPU per transaction soon after implementation for high volume transactions can provide a jump start for important follow-on analysis. (Ideally large CPU increases could be proactively identified in a development environment prior to Production implementation, but unless your site has mature DevOps processes, that can be challenging to achieve).

Figure 5: Comparison of IMS CPU per transaction across weeks.

Tooling that delivers programmatic solutions to identify significant changes can expand the scope of this type of analysis. In figure 6, several key IMS transaction metrics for the current day are compared to the prior 30 days, and changes that exceed 2 standard deviations are highlighted. Note the statistically significant increase in elapsed time per transaction in the top row of the report and the significant CPU per tran and elapsed time increases for the transaction in the bottom row.

Figure 6: Programmatic change detection for key IMS transaction metrics.

Continuing the themes of seeking to become more proactive and of leveraging tooling to enhance performance management, another big step forward toward maturity is replacing occurrences of diagnosing slow response incidents with proactive actions to avoid them entirely. In most cases when there is a major outage, a subsequent post-mortem discovers one or more key metrics that had started showing signs of stress days before the outage occurred. This highlights the potential value of automated assessments of key metrics to identify potential risks to availability and performance. There are too many metrics and too many components across the z/OS infrastructure to succeed at that task with a manual approach.

Figure 7 shows a health assessment of several key IMS pool-related metrics, two buffer pool hit ratios (where high is good) and two pool utilizations (where high is bad). The thresholds used here are based on the recommendations IBM IMS experts use when they are called upon to perform manual health checks for customers. The yellow icon indicates a warning level of the exception is occurring but only during a limited number of measurement intervals. The capability to have ongoing automated health assessments (rather than having to wait for IBM experts to be available to perform health checks every year or two) can help sites manage availability and performance in a more proactive manner.

Figure 7: Health assessment of IMS pool metrics.

From this point, a next analytical step might involve generating time charts of these metrics for a selected IMS subsystem (Figure 8) that can serve as a launching point for further analysis of one of the metrics.

Figure 8: Time charts of IMS pool metrics.

One important aspect of infrastructure and application design to deliver high availability is to distribute processing of that application across multiple hardware and software components. Ensuring the workload is distributed across those components in a relatively balanced manner is required to realize the benefits of that design, namely, that the loss of any single hardware or software component will impact only a minor percentage of the overall workload.

An example of workload balance analysis for an IMS workload appears in Figure 9. Chart 1 shows that the intended design of relative balance across six IMS subsystems is not being realized. Instead, a disproportionate share of the CPU activity is taking place on a single IMS subsystem (PRDA).

Figure 9: Analysis of IMS workload balance.

Drilling down on Region Type for the PRDA subsystem (chart 2) indicates most of the CPU consumption is driven by CICS DBCTL, the interface which enables CICS transactions to access IMS DL/I databases. Displaying the CICS DBCTL CPU consumption across all six subsystems (chart 3) confirms the suspicion that almost all of it is occurring on PRDA. Analyzing the work executing in IMS message processing regions (MPRs) (chart 4) shows the intended relatively balanced pattern. This analysis equips the infrastructure team to take the necessary actions to distribute the CICS DBCTL workload more evenly across the IMS subsystems.

Visualization based on current measurement data can also be very helpful with managing the hundreds (or more) of hardware and software components and the multitude of connections between them. A solid high availability design may have been initially implemented for a given business application. But in the intervening years, has configuration drift occurred that invalidates that design?

One such topology view including CPCs, systems, and IMS, CICS, Db2, and MQ subsystems appears in Figure 10. As long as the IMS workload is relatively evenly distributed across the six IMS subsystems displayed here, this diagram validates that a solid high availability design for IMS is currently in place where at least 50% of the workload will continue to process despite the loss of any single component up and down the stack.

Figure 10: Subsystem and system topology.

Enhancing IMS Performance and Availability through SMF Data

The article Thirty years of DNA forensics: How DNA has revolutionized criminal investigations captures the fascinating story of the first person to be apprehended for committing murder by leveraging DNA technology in 1987. It goes on to say that “DNA profiling has become the gold standard in forensic science since that first case 30 years ago.” As indicated in the article title, DNA is a great example of how leveraging insights provided by data can inform and indeed “revolutionize” an entire discipline.

The rich measurement data produced by Z system components is a great strength of the platform and provides great raw material for the practice of performance management. This blog has shown how this data can be applied to enhance performance and availability for IMS.

0 comments

42 views

Permalink

https://community.ibm.com/community/user/blogs/camila-vasquez2/2025/11/12/performance-management-101-for-ims-workloads

AIOps: Performance and Capacity Management

AIOps: Performance and Capacity Management

Performance Management 101 for IMS Workloads

By Camila Vasquez posted yesterday

Leveling Up from Guesswork to Substantiated Conclusions

Understanding Baseline, ‘Normal’ Behavior

Isolating anomalous behavior

Optimizing performance and efficiency

Enhancing IMS Performance and Availability through SMF Data

Permalink

Additional
Resources

Office

Quick Links

AIOps: Performance and Capacity Management

AIOps: Performance and Capacity Management

Performance Management 101 for IMS Workloads

By Camila Vasquez posted yesterday

Leveling Up from Guesswork to Substantiated Conclusions

Understanding Baseline, ‘Normal’ Behavior

Isolating anomalous behavior

Optimizing performance and efficiency

Enhancing IMS Performance and Availability through SMF Data

Permalink

Additional Resources

Office

Quick Links

Additional
Resources