Written by Todd Havekost on May 31, 2021.
IBM's introduction of the Enterprise Consumption (EC) solution within Tailored Fit Pricing (TFP) brings with it a fundamental change in how sites should approach managing and reducing software expense. Instead of being driven only by CPU usage during the monthly peak four-hour interval as with the current R4HA model, under EC, software is charged based on the total accumulated CPU consumption for the entire month (at a much lower unit price).
The fact that all general purpose CPU consumption is chargeable under EC brings far-reaching changes in how one approaches CPU consumption and expense reduction. A multitude of considerations that were not relevant under the R4HA model are now entirely relevant, because the benefits of CPU reduction initiatives extend to all workloads that run at any time.
Previous articles in this series have focused on approaches to reporting under EC (‘Reporting on Tailored Fit Pricing Enterprise Consumption’ in Tuning Letter 2020 No. 3) and potential opportunities for reducing CPU that are often within the control of infrastructure teams [‘Potential Infrastructure CPU Reduction Opportunities in an Enterprise Consumption’ in Tuning Letter 2020 No. 4). This article will explore potential tuning opportunities involving analysis at the application and individual job levels.
This is again by design a “pick and choose” article that includes a wide-ranging set of tuning actions that have achieved significant CPU reductions in real-life customer environments. We don’t expect that every idea will be applicable to every site. Our goal is to provide a menu of ideas, and we hope that each reader will find some that are appealing and applicable in his or her environment. This is a far-from-comprehensive list, but we hope that these ideas will suggest other related adaptations that are also applicable. Also, while the primary audience of this article is staff in sites who have already adopted EC, these tuning ideas can also benefit readers seeking to reduce CPU for any reason, including hardware capacity constraints, peak R4HA reduction, batch window constraints, and so on.
The realization that “all CPU is in scope” could generate a nearly unlimited number of opportunities, resulting in a state of paralysis. Analysts involved in leading CPU reduction efforts will have to focus limited staff resources on the top potential savings opportunities. In addition to awareness of common types of optimization opportunities as discussed in these articles, key success factors also include visibility into top CPU consumers (so you know where to focus your attention), and early notification of significant CPU increases (so that you can ‘nip the problem in the bud’).
A “table of contents” follows, with links to the various job-level or application-level CPU reduction ideas in this article. You may want to scan the article sequentially or skip ahead to topics of immediate interest:
- ‘Write or Rewrite Programs in Java’ on page 4.
- ‘Reduce Job Abends’ on page 4.
- ‘Compile / Recompile Programs With Newer ARCH values’ on page 6 (this is included in this category because recompiling with a different compiler version or different ARCH
level involves application team testing).
- ‘Proactively Identify CPU Surges by Applications’ on page 9.
- ‘Database Tuning’ on page 11.
- ‘Non-database Data Set Tuning’ on page 13.
- ‘DFSMShsm Optimization’ on page 15.
- ‘Sort Tuning’ on page 15
Write or Rewrite Programs in Java
Since Java programs are eligible to execute on zIIP engines, and work that executes on zIIPs does not incur software expense, writing new programs or re-writing high CPU-consuming programs in Java can present a significant opportunity to reduce MSU consumption. (And we want to mention that since zIIPs engines always run at full speed, this results in an additional performance benefit on sub-capacity models, where general-purpose central processors (GCPs) run at a lower speed.)
Great progress has been made in addressing many of the factors that in the past have inhibited leveraging Java on z/OS. IBM has been architecting significant Java performance and efficiency improvements into z Systems hardware and software now for many models and versions, with average 20% throughput improvement just from z14 to z15 [for more information, see SHARE in Forth Worth 2020 Session 26542, Java on IBM z15 Update, by Rahil Shah]. Additionally, IBM middleware continues to enhance Java support and interoperability with existing applications [see SHARE Virtual Summit 2021 Session 28608, Java Update for CICS TS V5.6, by Phil Wakelin].
One site that we work closely with has had great success developing new RESTful APIs to z/OS resources in Java as part of their overall mainframe modernization journey. As a result of this initiative in which they leveraged z/OS Connect as their primary RESTful API provider, they now have a 1700 MSU workload executing on zIIPs that would otherwise have been generating software expense on GCPs.
In another example, Mario Bezzi kindly provided us with a Java version of our RMFPACK utility that compresses SMF data for input into IntelliMagic Vision. After converting to the Java version, general purpose CPU for one customer decreased from 11 CPU minutes to 4 CPU seconds. Another customer was only sending SMF data every several hours in order
to keep the RMFPACK CPU out of their peak R4HA interval during prime shift. Now thanks to Mario's Java version of the utility, that CPU is insignificant and no longer limits them from processing and sending their SMF data more frequently.
Reduce Job Abends
Another consideration of the “all CPU is chargeable” change brought about by EC is that jobs that abend and have to be rerun now represent a real financial cost. With an R4HA-based pricing model, there is no direct impact on your software bill if the job is rerun outside the peak R4HA. In fact, if your peak R4HA is during the prime shift, you can rerun the job as many times as you like without impacting your bill, as long as the reruns happen during the batch window. With an accumulated consumption model, every rerun contributes something to your bill, so this is a topic that we expect will receive heightened attention as more sites transition to TFP.
The cost saving benefit of avoiding job reruns provides additional impetus to the traditional rationale for wanting to reduce abends; namely, avoiding elongating batch job elapsed times and potential impacts on online transaction availability. And the delays and costs are not limited to the impact of re-running the job; there is also the elapsed time to investigate and address the cause of the failure. Other impacts include delaying downstream work that is dependent on the failed job, manual processes such as report deliveries that can’t occur until the job ends successfully, and pushing batch CPU consumption into the prime shift and increasing the peak R4HA.
Figure 1 illustrates the type of report that can shed light on potential opportunities for CPU savings, sorted by abend code.
Figure 1 - Resource consumption by abended jobs by Completion Code (© IntelliMagic Vision)
Further analysis by job name within a given abend code (as shown in Figure 2 on page 6) may proceed from here for abends that are potentially avoidable. For example, repeated instances of job cancellations accompanied by high CPU consumption may reflect looping conditions that warrant investigation and potentially stricter controls. Another common
reason for abends is x37 abends due to lack of free disk space or very high levels of fragmentation.
Figure 2 - Resource consumption for abend '0222' jobs by address space name (© IntelliMagic Vision)
Compile / Recompile Programs With Newer ARCH values
Since IBM Z machine cycle speeds have been more or less flat since the zEC12, and are likely to remain that way for the foreseeable future, increases in capacity delivered by each new processor generation are increasingly dependent on other factors, including exploitation of instructions added to the architecture for each new processor model. Over recent generations in particular, the IBM chip designers have been working closely with the compiler developers to identify frequently used high level language statements and adding new CPC instructions to optimize the speed of those instructions. This is illustrated in Figure 3 on page 7. Each time a new set of hardware instructions are added, that becomes a new ‘architecture level’ - the architecture level (ARCH) for each CPC since the z9 is shown in the table in this figure.
Figure 3 - New instructions by CPC generation
For the COBOL compiler, the historical default has been ARCH(7), which corresponds to a z9 processor that was introduced in 2006 and withdrawn from marketing in 2010 - hardly bleeding edge! As you can see in the chart above, if a program is compiled for the z9 architecture level, it only has access to about 60% of the total Z instruction set, and it is unable to exploit the most recent instructions that were designed specifically to optimize the performance of high level language programs. Hopefully none of our readers are still running applications on a z9 or older CPC. But if your programs are still compiled with ARCH(7) (or no ARCH level at all), they are still using the same instructions that they did on a z9, meaning that you are very likely missing out on significant efficiencies.
Explaining the increased importance of remaining current on compilers and ARCH level settings, IBM’s David Hutton says “hardware/software synergy is essential in this post-generational-frequency-bump era of computing” (from David’s SHARE in Phoenix 2019 Session 24517, IBM Z Microprocessor Optimization Primer).
For one example, in ARCH 11 and lower levels, packed decimal arithmetic can only be performed using in-memory data, or by converting the data back and forth to Decimal Floating Point (DFP). In ARCH 12 (supported by z14 and higher), the new vector packed decimal facility enables the compiler to perform native packed decimal arithmetic on data in registers. This provides the performance advantages of using registers instead of memory, while eliminating the overhead of converting data back and forth between packed decimal and DFP.
What does this mean to your applications? A good IBM answer - it depends. If you have a COBOL program that does nothing but packed decimal arithmetic, it would do the same work with about 65% less CPU time if you compile it with ARCH(12) rather than ARCH(11). But we don’t imagine that you have many programs that do nothing but packed decimal arithmetic. You probably have other programs that will use exactly the same CPU time with ARCH(12) as they did with ARCH(11) because they don’t use any instructions that were enhanced by Architecture Level 12. More than likely, your programs use a mix of COBOL instructions that have been enhanced by some of the new architecture levels, and some instructions that have not. So the overall performance benefit that you are likely to experience is somewhere between these two extremes.
Speaking to clients of ours that have moved from the old COBOL V4 to COBOL V6 and are now using ARCH(12) or ARCH(13):
- Some programs have seen a 50% reduction in CPU and elapsed times.
- Some have seen CPU times drop from minutes down into seconds.
- And some programs are consuming about the same amount of CPU time with COBOL V6 and ARCH(12) as they were with the old COBOL V4.
- The overall average CPU savings seems to be around the 20% range.
- Average elapsed time reductions are in the same ballpark.
Probably the most important take away from this is that average savings number of 20%. That value is also consistent with the improvements shown in Table 1. The values in that table are from the IBM Enterprise COBOL for z/OS, V6.3 Performance Tuning Guide, SC27-9202, and are based on IBM internal performance benchmarks. That document also noted that when moving from ARCH(8) directly to ARCH(13), the average performance benefit was 23%.
Table 1 - COBOL ARCH average improvements (© IBM Corp)
Tom Ross (‘Captain COBOL’) has some very interesting SHARE presentations that discuss the performance benefits of different architecture levels - a good one would be SHARE in Fort Worth 2020 Session 26535, Reduce costs or improve performance with cutting edge COBOL offerings on z15. It does a good job of explaining why some programs benefit more than others. It also provides some more comparative performance numbers from IBM internal testing, however the numbers only go back as far as ARCH 10, and it isn’t always easy to separate out the benefit of the new instructions versus the benefit of running on a faster CPC.
While the potential savings that can be achieved from recompiling programs with the highest ARCH value that is supported by all your CPCs are both real and substantial, we don’t want to understate the migration effort of moving from COBOL V4 to COBOL V6. This is not as simple as just recompiling your programs, and the compile time options that you use can make the migration easier or harder. A good place to start would be with the free migration workshop that Tom references in his presentation. You should also review ‘COBOL V6 Compiler Tips’ in Tuning Letter 2018 No. 4.
Note: The COBOL migration portal was recently updated to make it easier to use and to add more material and information. Even if you visited it before, we recommend that you go back and see if there is additional information that would be relevant for your situation.
Proactively Identify CPU Surges by Applications
As mentioned earlier in this series, having automated processes in place to provide early notification of significant changes in CPU consumption is a key success factor in effectively managing an EC environment. This is also very applicable for the address space and application focus of this article.
Figure 4 on page 10 shows a situation where the CPU consumed by a production batch job on a given day increased to a level more than two standard deviations higher than the previous 30 days.
Figure 4 - Job CPU Time Change Detection (© IntelliMagic Vision)
A next logical question might be to see if other production jobs in that application also experienced sizable increases. As is likely the case in most environments, job name standards can be leveraged to answer that question by examining other jobs with a JCA* prefix. As you can see in Figure 5, other jobs also experienced a CPU jump.
Figure 5 - Job CPU Time Change Detection report (© IntelliMagic Vision)
Another data point to help determine whether this may be worth investigating is whether this was a one-time bounce, or represents a sustained pattern. The view over the reference period in Figure 6 reflects an almost 10x increase that has now been sustained over several days and thus may warrant attention.
Figure 6 - Total CPU Usage Patterns (© IntelliMagic Vision)
Database Tuning
We find that the overwhelming majority of customer data is in Db2 these days. There is still a little VSAM and IMS, but in the clients that we work with (large and small), most production jobs and data use Db2. The problem with that, for me as a z/OS person, is that the SMF type 30 records provide very little information about batch jobs that use Db2. You can’t even see the name of the program that is being executed, or the total number of I/Os that were actually executed by, or on behalf of, that job.
Therefore, when looking for candidates for potential tuning, one of the things we look at is the long term trend of CPU time and elapsed time for jobs that use a significant amount of CPU time.
I know very little about Db2, but it seems to be reasonably common that the behavior of a program can change subtly over time. It might be because of logic changes in the application, or possibly because of workload changes. But whatever the reason might be, the data access pattern of the program is now different. This can result in Db2 not accessing the data as efficiently as it might have in the past, potentially resulting in a significant increase in CPU time. However, because these increases can happen relatively slowly over time, they can go unnoticed, especially if the jobs are not on the batch critical path.
A couple of examples that we recently encountered illustrate the savings that can be achieved if someone can find the time to do a little investigating.
We actually reported on the first one in a user experience called ‘Really Fine Tuning’ in Tuning Letter 2020 No. 4. In that case, the logic in the program had shifted subtly over the years, and the volume of data in some of the tables had increased a lot more than expected. So they determined that a new index was needed on one of the tables. Adding that index reduced the daily CPU time from around 5 hours to under 15 seconds. All credit for this improvement is due to the DBA that investigated the issue and identified what needed to be done to address it. I certainly think that this was a good investment of that person’s time.
We found a sort-of similar situation in another client. In that case, when we looked back over a year’s worth of SMF type 30 information for the job, you could clearly see a trend of slowly growing CPU time. When the DBA investigated, they found that the application had a table that was basically a log of transactions that just built and built over time, to the point where it had over a year’s worth of information, even though the application only actually required a maximum of 30 days worth of records. It is very easy to see how this could be overlooked when the application was new and that table was tiny. And by the time the table had grown large enough to be a performance concern, the application had reached steady state and people’s attention had naturally shifted to other work. Adding a weekly job to delete old and unneeded rows from that table resulted in the CPU time for that step dropping from 35 minutes to just 5 minutes. Once again, all credit for this improvement goes to the DBA that looked into the application behavior and discovered the reason for the CPU time growth.
In another study that we were involved in, one of the DBAs came to me after the project had been running for about a week. I was expecting to be told off because of all my annoying questions. Instead, she said to me “Wow, this is really fantastic. I normally spend my whole time fighting fires, but because you guys are asking all these questions I now have an excuse to go in and fix all of these issues.”
Previously, when everyone was using R4HA-based pricing options, it was difficult for technicians to justify the time to investigate and address issues like these, especially if the peak R4HA interval was during the prime shift. However, now that more sites are shifting to accumulated-consumption-based billing, it should be easier to cost-justify tuning exercises such as these. Not only do they result in better service for the end user, and save the company money on the software bills, this work is technically interesting and rewarding, so it is a real win-win.
I’ll finish this section with one last example. In this case, the CPU time had been gradually increasing over time. Then a workload change caused a much steeper increase in CPU time. Then another application change resulted in a huge drop in CPU time, even far below the steady-state CPU consumption before the first change. This is shown in Figure 7. The columns on the right side of the chart show the CPU time after the application change - it is difficult to see them because they are so small, but take my word for it that they do actually contain non-zero values. How fundamental was the program redesign that delivered such impressive results? The developer changed the sequence of table names on a SELECT statement! That’s it - can you imagine such a small change having such a dramatic impact?
Figure 7 - Application program CPU usage over time
The reason for reporting these real world experiences is not to turn you into database or SQL experts. Rather, the objective is to illustrate the savings that can be achieved with a little application and/or database fine tuning. I should also stress that some of the other Db2 programs that we looked at were already as highly tuned as they could be without a major redesign. So don’t think that we are saying that every Db2 program has huge potential for impressive tuning results. But in an EC environment, if you can save a few minutes from a job that runs every day, that can add up to a worthwhile capacity and software cost saving over a year.
Non-database Data Set Tuning
When we perform a general system tuning/MSU Reduction project for a client, one of the things we always look at is the SMF type 30 records, looking for the ‘big hitters’ - the jobs that use the most CPU time or perform very high numbers of I/Os. We often find that one goes with the other - every I/O requires some amount of time to initiate it and process the response, and programs that are very I/O intensive don’t get as much value from the Level 1 and Level 2 caches, and that further increases their CPU time.
Depending on the type of data set that is the target of all the I/Os, there are different ways to help them. For example, for sequential data sets, you can:
- Ensure that the data set is using a system-determined blocksize.
- Use zEDC compression to reduce the number of bytes that need to be transferred back and forth to the disk subsystem.
- Consider using data set striping for very large data sets.
- Use the most efficient programs possible - for example, ICEGENER or SYNCGENR, rather than IEBGENER.
- If feasible, consider the use of pipes to send the sequential output of one program directly to the program that will then read that output.
These and other options were discussed in ‘Spring Clean Your Data Sets With DCOLLECT and a Little SMF’ in Tuning Letter 2020 No. 4.
If you find very busy PDS or PDSE data sets, investigate the use of LLA, VLF, and/or PDSE member caching. In Tuning Letter 2020 No. 4, we had a user experience titled ‘IWS Workload Automation Programming Language Tip’ that described how adding a load library to LNKLST reduced the CPU time for the step by 5%, the elapsed time by 50%, and the number of I/Os by 99%. A 5% reduction in CPU time might not sound like something to get too excited about, however the change eliminated nearly one million I/Os, so that had a
knock-on benefit on other jobs that no longer had to compete with those I/Os.
We are currently working on an article about PDS and PDSE tuning, and we hope to have that ready for the next Tuning Letter issue.
The other types of data sets that we want to briefly touch on here are VSAM data sets. This is a much bigger topic than we can cover here, and we plan to dedicate an article to VSAM data set tuning in a future Tuning Letter. For now, we just want to point out that the standard VSAM buffering is frequently not optimal for batch programs, and you might find that you can achieve some quick savings through the use of Batch LSR or System- anaged Buffering. We had a short discussion about Batch LSR in the user experience called ‘Put Some Zip in SMP/E’s Steps’ in Tuning Letter 2020 No. 4.
Both Batch LSR and System-Managed Buffering are part of z/OS, so there is no additional charge for them. There are also multiple vendor products that specialize in dynamic VSAM buffering, and would generally be expected to deliver better performance than Batch LSR or System-Managed Buffering. Another possibility is IAM (previously from Innovation Data Processing, now owned by BMC). You would need to determine whether the performance benefits these products can deliver are sufficient to justify their cost.
DFSMShsm Optimization
While HSM CPU consumption likely won’t show up if you are looking at CPU usage at the application job-level, it is likely that many of your production applications are using HSM services and thus driving HSM CPU consumption ‘behind the scenes’. Certainly, if you are looking at CPU consumption across all address spaces (including started tasks), it is very likely that the DFHSM address space will be up there among the ‘big hitters’.
We know many sites that are aware of HSM’s impact. However, most of HSM’s CPU consumption occurs during the batch window, away from the prime shift R4HA peak. And even though it can be a large consumer, it rarely causes problems. As a result, “Optimize HSM CPU usage” is another one of those items that hovers close to the bottom of your to-do list.
While it is frustrating to see optimization opportunities go unaddressed, it is understandable. If you are not CPU-constrained during your batch window, and if HSM is not contributing to your software bill, you probably won’t get much thanks for reducing its CPU consumption. However, if you move to TFP, all those CPU seconds and minutes and hours that HSM racks up now contribute to that bill. And don’t forget that HSM does this every day, so its total CPU time by the end of the month can come to quite a large number.
If you can drum up sufficient interest in optimizing HSM CPU consumption, there might be quite a broad selection of actions that you can take. IBM’s Glenn Wilcock worked with us on our ‘Optimizing Your HSM CPU Consumption’ article in Tuning Letter 2018 No. 4. That article is chock full of ways to reduce your HSM CPU consumption (with ‘Exploit zEDC’ being number one in that list) - I would be amazed if you don’t find at least one item in that article that will be valuable to you. As a bonus, Glenn recently highlighted DFSMS APAR OA58019 as one that can reduce CPU consumption in HSM and DSS handling of extended format sequential data sets - this is likely to be of most interest to sites that are using data set encryption.
Sort Tuning
According to IBM (and we have seem similar estimates from other vendors), sorting data accounts for between 10% and 25% of z/OS CPU consumption. Even if those estimates are twice the actual numbers, that is still a lot of CPU time that is being spent on sorts. The remaining sort products for z/OS (DFSORT and SYNCSORT) are both highly optimized, very intelligent products now. However, that doesn’t mean that there aren’t things that you can do to let them run even more efficiently. Mario started on a section about DFSORT for this article, but it grew into a full article in its own right, so you should check out ‘Sorting Out Your Sort Performance’ in Tuning Letter 2021 No. 1 for more information about sort and things you can to do unleash its full potential.
References
These references provide more information about the types of application tuning discussed in this article.
- ‘Optimizing Your HSM CPU Consumption’ article in Tuning Letter 2018 No. 4.
- ‘Put Some Zip in SMP/E’s Steps’ in Tuning Letter 2020 No. 4.
- ‘Spring Clean Your Data Sets With DCOLLECT and a Little SMF’ in Tuning Letter 2020 No. 4.
- SHARE in Phoenix 2019 Session 24517, IBM Z Microprocessor Optimization Primer, by David Hutton.
- SHARE in Forth Worth 2020 Session 26542, Java on IBM z15 Update, by Rahil Shah.
- SHARE Virtual Summit 2021 Session 28425 - Z Sort on z15 + Large Memory = Big performance!, by Joe Gentile.
- SHARE Virtual Summit 2021 Session 28608, Java Update for CICS TS V5.6, by Phil Wakelin.
Summary
In this series of three articles, we have sought to equip performance analysts responsible for managing the brave new world of EC where all CPU is chargeable. We have suggested the types of visibility and tooling that will enable you to quickly identify sources of increased CPU usage and isolate the driving causes. And we have compiled a broad array of potential CPU tuning opportunities, all based on real world customer experiences.
We have tried to provide a mix of tuning activities. Some can be carried out by the infrastructure support teams (system programmers, DBAs, and performance analysts) independently. Others can be led by the infrastructure staff, but will require coordination and possibly assistance from the production support and application development staff.
Our most important objective, however, was to try to get you thinking with a different mindset. We (the Z technical community) have become experts at tweaking our workloads and configurations to work with a R4HA-based pricing model. With the move to an Enterprise Consumption model, it will take all of us a little time to adjust to the idea that all tuning can deliver real financial, as well as customer service, benefits. To identify the most ‘lucrative’ targets for our tuning actions we need powerful and flexible reporting tools, such as IntelliMagic Vision.
We hope that you enjoyed this series-within-a-series, and we look forward to our next issue where we delve into new ways to extract the maximum value from the ‘SMF ocean’.