Written by Todd Havekost on November 18, 2020.
This is the first in a pair of articles on how you can leverage insights available through SMF data to help manage and optimize an environment operating under an Enterprise Consumption software license model.
A prominent theme among IT organizations today is an intense focus on expense reduction. For mainframe departments, this invariably involves seeking to reduce software expense from IBM and other vendors, because these commonly represent the dominant portion of the infrastructure outlay for the platform.
IBM's introduction of the Enterprise Consumption (EC) solution within Tailored Fit Pricing (TFP) brings with it a fundamental change in how sites should approach managing and reducing software expense. Instead of being driven only by CPU usage during the monthly peak four-hour interval, as is the case under the current Monthly License Charge (MLC) model, under the EC model, software is charged based on the total CPU consumption for
the entire month (at a much lower unit price, of course).
This article focuses on the EC model:
- Because it is new. It is a fundamentally different model to the one we have been using for the last 20 years.
- Because it effectively eliminates the traditional method of controlling software costs (capping), meaning that it is much more important that you proactively monitor and track your MSU consumption.
- Because EC means that cost control will be much more reliant on traditional tuning practices. This aspect of EC will be the subject of the next article in this mini-series.
This doesn't mean that traditional monitoring of the R4HA has been eliminated. IBM's position is that EC should be used for production systems, and a DevTest Container Solution (which is still based on R4HA) should be used for non-production systems. Both EC and the DevTest Container Solution have been described in previous Tuning Letter articles, but in this article we will be focusing on what happens after your company has decided to move to EC, and now you have to manage it.
As adoption of this new licensing model grows, it dramatically expands the scope for CPU tuning. No longer are workloads active during peak four-hour intervals the sole focus for this activity.
Important: Under the “brave new world” of EC, the benefits of CPU reduction initiatives extend to all workloads that run at any time.
Analysts involved in leading these tuning efforts will be tasked with identifying the top potential savings opportunities. This first article will focus on how insights from SMF data can be helpful in reporting and managing enterprise consumption. Areas that will be explored include tracking actual consumption compared to contracted quantities, identifying variances between intervals, and drilling down to identify the drivers of those changes, from higher levels (e.g., workloads) to more granular (e.g., address space, transaction). Also, an approach to receiving early notification of significant increases will be examined. The second article will explore numerous areas that can be pursued as sources of potential CPU reduction opportunities.
Tracking Your MSU Consumption - Top-level Views
EC contracts are typically structured for multiple years, with a committed baseline annual MSU consumption, a committed MLC amount (dollars or Euro or whatever), and sometimes a committed growth amount. If your actual usage exceeds the baseline MSUs, then you pay a discounted price per MSU for MSUs that exceed the baseline amount. On the other hand,
if your actual usage is less than the baseline amount, you still have to pay the committed MLC amount.
In an ‘ideal’ environment (from a purely financial perspective) your actual MSU consumption would exactly match the baseline amount. This would mean that you don’t pay any more than your committed amount, but also that you don’t have to pay for more than you actually used. But from a business perspective, you probably hope that your business is growing,
which would typically result in more transactions, and therefore more MSUs being consumed. It would be nice if your tuning activities saved the same number of MSUs that are consumed by your additional transactions.
But achieving these desirable objectives depends on knowing where you stand compared to your committed MSUs1. A starting point for managing this environment will involve comparing cumulative actual MSU consumption to the committed annual quantity in the contract. Figure 1 on page 5 depicts the analysis taking place at the end of a contract year that extends from August through the following July, with two lines added to show two possible scenarios:
- In Scenario 1, your actual MSU consumption for the year exceeds the baseline amount. There could be any number of reasons for this, but the bottom line is that your bill for the year will be higher than the baseline MLC amount. If your business has been growing and the extra consumption reflects that, the increased MLC bill might be in line with your expectations, so that would be fine. However, if the growth is because of avoidable problems, a larger MLC bill without any corresponding business growth might not be so fine.
- In Scenario 2, your actual MSU consumption is less than the baseline amount. Depending on your contract with IBM, you might be able to carry a credit forward to the following year, meaning that you could support increased business volumes next year without an increase in MLC costs - you will be a hero with your Finance colleagues. On
the other hand, if the decreased MSUs is part of a long-term trend, someone will not be happy that your transaction volumes are decreasing but your MLC bills are not. In this case, you might want to avoid your Finance buddies for a while.
Figure 1 - Cumulative MSU Consumption by CPC (© IntelliMagic Vision).
Regardless of which situation you are in, it is critical that you are aware of the mis-match in advance, that you are aware of the scale of the problem (or opportunity), and that you can explain why your actual MSU consumption deviated from the baseline in the way that it did. Discovering that your bill exceeded expectations after the end of the year is probably not part of a successful career plan.
Extending the chart into the current year (see Figure 2 on page 6) enables you to see precisely where you are today. It also helps you see times when the growth was faster than average, and how this year is comparing to last year. (Note that the Y-axis scale has been changed to millions of MSUs to enhance readability.)
Figure 2 - Cumulative MSU Consumption (© IntelliMagic Vision).
In this example, the consumption is nearly identical in each month, resulting in a nearly straight line. However, many sites experience months that are lower than normal (perhaps when everyone is on vacation), and months when the consumption is higher than average. When business workloads reflect seasonal patterns, you may want to compare MSU consumption over time across multiple years to get a more informed picture about whether the consumption you are seeing is normal for that month. Figure 3 on page 7 shows year-to-date consumption for 2020 slightly exceeding the comparable interval from 2019, but both years following a similar pattern of lower MSU consumption in the first quarter, followed by higher usage in the second quarter.
Figure 3 - Cumulative MSU Consumption by Year
If you find that the consumption is growing at an unexpected rate, you should investigate the source of that growth. It might be that the growth is due to some planned application or workload change, but you won’t know for sure until you drill down into the MSU consumption.
Figure 4 on page 8 shows an example of how you might break out the monthly MSU consumption by system (the chart shows data for the period from August 2019 to October 2020, so provides a nice view of the month to month change for each system - in this example, you can see that the consumption in each month is very consistent). Note that z/OS doesn't track its cumulative MSU consumption for the day/week/month/year - the SMF type 70 records contain that information for each interval. To get the monthly, and month-to-date view, you need some mechanism to capture that data from each interval’s SMF records (ideally at the system level). This will enable you to report on cumulative MSU consumption over any time period that is relevant to you.
Tip: Remember that your Devtest systems will typically be in a DevTest container when using TFP, meaning that the MSU consumption of those systems should not be included in the MSUs that you are tracking for EC.
Figure 4 - Cumulative MSU Consumption by System ID (© IntelliMagic Vision)
Of course, the end of the contract year is not the time to discover that your MSU consumption has been running ahead of plan. You will want to be very aware of MSU consumption throughout the year, with particular focus on significant changes.
Analysis seeking to identify the causes of such changes might begin with views by sysplex or by WLM-defined workload. Figure 5 on page 9 shows an example of comparing CPU consumption by the different subsystems, for all systems in the EC ‘container’, for four consecutive months (March to June 2020). This view helps you see the variation at the total pricingplex level from one month to the next, and also helps you see which subsystems are growing or shrinking in each month.
Figure 5 - Interval (Monthly) MSU Consumption by Workload (© IntelliMagic Vision)
Note that official IBM pricing MSU consumption data is based on data present in the SMF type 70 records. When carrying out analysis at levels below the system level, remember that z/OS is not able to accurately assign all CPU time to workloads and service classes (reported in the SMF type 72 records) and address spaces (reported in the SMF type 30 records). These variances are quantified in system “capture ratios”, which can range from percentages in the mid-90s on production systems, to 70%, or even lower, on small test systems. Though MSU calculations at these lower levels do not represent “official” chargeable MSU consumption, they are still very helpful in determining relative relationships, e.g., between workloads or time periods.
One focus of analysis may be comparisons across time intervals to identify changes. When business workloads reflect seasonal patterns, you may want to compare MSU consumption between a given month, and the corresponding month from the previous year. Figure 6 on page 10 compares total monthly MSU consumption year over year for the month of August. You can see that the usage is a little higher in August 2020 (on the right), and that the growth was driven primarily by the CICS and Db2 workloads.
Figure 6 - Interval (Monthly) MSU Consumption by Workload (© IntelliMagic Vision)
Area charts are helpful to depict aggregate consumption, but component variances can be more easily identified with line charts. Figure 7, which shows the period from March 2020 to August 2020, illustrates how a line chart makes it easier to see the month-to-month changes for each subsystem. In this example, it clearly shows growth in the CICS (blue line at the top) and SYSTEM (gray line, in the middle of the chart) workloads from March to May.
Figure 7 - Interval (Monthly) MSU Consumption by Workload (© IntelliMagic Vision)
Further analysis of this increase in the SYSTEM workload from March to May would likely involve drilling into MSU consumption by system ID. Figure 8 on page 11 shows systems 2, 3, 6 and 7 being responsible for the bulk of that increase. This visibility is important because SCRT only reports MSU consumption at the entire pricingplex level after you move to TFP. If
your total MSU consumption is increasing in an unexpected way, you need to be able to drill down to see what is driving that behavior.
Figure 8 - Interval MSU Consumption for Workload SYSTEM by System ID (© IntelliMagic Vision)
Analysis from here would typically involve drilling down into this data, looking at the address spaces that make up the workload(s) that are responsible for the growth in MSU consumption on each of the miscreant systems. Figure 9 on page 12 indicates that the spike in CPU consumption by the SYSTEM workload on SYS3 is driven by the XCFAS address space.
Figure 9 - Interval MSU Consumption for Workload and System by Address Space (IntelliMagic Vision)
Now that you have identified the address spaces that are responsible for the MSU growth, you would follow your normal processes for investigating unusual behavior. In the next section we show how SMF data might help you with that investigation.
Visibility into Top CPU Consumers
Knowledge of the top CPU consumers at the workload or service class level will direct your next levels of more detailed analysis as you seek to identify and possibly optimize specific units of work. These may include system tasks, batch jobs, CICS transactions, Db2 work units, etc. Generating lists of top CPU consumers in these various areas will enable you to prioritize the biggest potential opportunities.
For example, if CICS is a top CPU workload, you will likely be interested in total CPU consumed per transaction ID (sometimes called “CPU intensity”), which reflects the combination of transaction volume and CPU per transaction. These top consuming transactions can be examined for optimization opportunities. Additionally, comparisons between different time intervals may also be helpful. Figure 10 on page 13 compares CPU
consumption, in thousands of MSUs, for each of the top 10 transaction IDs for two comparable months. In this example, rather than comparing transaction ABCD in one month to the same transaction in another month, you might be just trying to get a feel for whether the CICS transaction mix has changed, or if the same set of transactions is responsible for most of the CICS CPU usage in the two months you are comparing. (See later in the article for an approach to automatically detecting statistically significant changes.)
Figure 10 - CPU Usage by CICS Transaction ID (Top 10) (© IntelliMagic Vision)
If Db2 is a significant CPU consumer in your environment, understanding which connection types are driving the majority of the general purpose CPU usage within Db2 will indicate where you should focus the next more detailed level of analysis. Once again, SMF comes to the rescue, helping you break out Db2 CPU usage by where the work came from. When
looking at this data, remember that work run on a zIIP does not count towards your software bill, so that time should not be included in your reports. An example of the type of report you can produce is shown in Figure 11.
Figure 11 - Db2 Class 2 GCP Usage by Connection Type (© IntelliMagic Vision)
From there, you will likely want to view the data in different ways depending on the connection type. For CICS Call Attach, focusing on Db2 general purpose CPU consumed by CICS transaction ID makes sense. For some other connection types (including IMS or the various Call Attach types), the next logical step may be to view usage by Plan (which correlates to applications). For DDF, you will likely want to view GCP time by AuthID, as
seen in Figure 12.
Figure 12 - DRDA (DDF) CPU Usage by Authorization ID (© IntelliMagic Vision)
In most environments, a significant amount of CPU capacity is consumed by system tasks. As you identify the top consumers in your environment, you may find Tuning Letter articles or SHARE presentations or other online material to guide your tuning efforts. In Figure 13 on page 15, DFHSM is the top consumer among started tasks (coincidentally, article ‘Optimizing Your HSM CPU Consumption’ in Tuning Letter 2018 No. 4 provides some great
pointers to reducing DFHSM CPU usage).
Figure 13 - CPU Usage for Workload STC by Address Space Name (Top 10) (© IntelliMagic Vision)
For many, if not most, environments, the batch workload will be one of the top consumers of CPU capacity. Thus, the total CPU consumed by the top jobs will be of great interest (see Figure 14). A view that accumulates this data over an entire month, so that month-end jobs are included, may be advisable.
Figure 14 - CPU Usage for Workload BATCH by Job Name (Top 10) (© IntelliMagic Vision)
Viewing total CPU consumed by program name across sets of address spaces may also be helpful in identifying opportunities for reducing CPU consumption. Though each instance of a program may consume a relatively small amount of CPU, it may represent a significant opportunity if it is invoked thousands of times. Figure 15 shows the CPU consumption for
the top 10 programs called by batch jobs. We will delve into these programs in a little more depth in the article in the next Tuning Letter.
Figure 15 - CPU Usage for Workload BATCH by Program Name (Top 10) (© IntelliMagic Vision)
Gaining Early Notification of CPU Increases
As discussed throughout this article, under the EC model, all general purpose CP (GCP) consumption by all workloads at all times is chargeable. This highlights the importance of getting early notifications of CPU consumption increases so that you can “contain the damage”, and not get to the end of the month and find that you unexpectedly consumed substantially more CPU than you did in the previous month.
The reporting identified up to this point in the article relies on your diligence to frequently review numerous reports to identify significant changes that drive increased CPU consumption in a timely manner. This is a challenging and time-consuming task at best, and points to the value of putting the computer to work to automatically perform this analysis for you.
This process may begin from various higher levels, e.g., by sysplex or sysid. Figure 16 on page 17 reflects one approach to automatically detecting CPU increases. It highlights a change in daily CPU usage for system SYS1 that exceeded 2 standard deviations when compared with the previous 30 days. Once you have been made aware of the change, isolating the primary driver(s) can proceed using standard drill down and reporting functionality.
Figure 16 - System Change Detection (© IntelliMagic Vision)
If the CICS workload is a primary driver of the CPU increase (as was the case in Figure 7 on page 10 earlier), you will want to proceed with analysis at the transaction level. The total CPU consumed by each transaction ID and the average CPU consumption per transaction may both be of interest, depending on the scenario. Figure 17 identifies modest gains (greater than 1 standard deviation) in total CPU consumption for transactions TR03, TR06, TR10 and TR12, and a more significant increase in CPU per transaction (exceeding 2 standard deviations) for transaction TR12.
Figure 17 - CICS Transaction Change Detection (© IntelliMagic Vision)
Further analysis of the increase in CPU per transaction for TR12 may be helped by a view that shows the average for the current day (blue line) in context of both the average for the entire interval (yellow line), as well as by discrete measurement interval (red line). In Figure 18, the CPU increase for the current day (highlighted by the blue arrow) appears to be a continuation of an increase that began the previous week, accompanied by sizable spikes that also began in that same time frame. One ideal deliverable of this type of analysis is to proactively identify increases in DevTest, where they can potentially be addressed prior to migrating to Production.
Figure 18 - CICS CPU Per Transaction Patterns (©IntelliMagic Vision)
In a large, complex, environment, things do change from day to day, for any number of valid reasons. You can’t afford to spend all your time investigating every variance to determine if it is a situation that warrants intervention. Conversely, without the safety net of a soft cap or group capacity limit to protect you from runaway activity (and runaway bills), there is definitely more need for proactive monitoring of MSU consumption. Using a methodology such as this to identify out-of-line situations or trends can help you keep control over your bills, without having to invest inordinate amounts of your time.
References
You can find more information about Tailored Fit Pricing Enterprise Consumption and associated reporting in the following documents and presentations:
- IntelliMagic zAcademy webinar Does IBM's Recent Tailoring of Tailored Fit Pricing Make It a Better Fit for You, by Cheryl Watson.
- IntelliMagic webinar Your Datacenter Under TFP: The New Rules of Measurement by Cheryl Watson and John Baker.
- ‘Optimizing Your HSM CPU Consumption’ in Tuning Letter 2018 No. 4.
- IBM’s Tailored Fit Pricing (TFP), in Tuning Letter 2019 No. 4.
- IBM’s TFP - Update, in Tuning Letter 2020 No. 1.
- IBM Container Pricing, in Tuning Letter 2018 No. 1,
- IBM Container Pricing, in Tuning Letter 2018 No. 3.
- SHARE 2020 webcast Tailored Fit Pricing: How to Manage Workload in a World Without Capping, by Andrew Mead, Roger Rogers, and Rebecca Levesque. Note that this is essentially a marketing presentation for TFP and supporting products, so it might paint a slightly more rosy picture of TFP than we would. Nevertheless, it does provide some interesting and helpful information.
Summary
The introduction of the Tailored Fit Pricing Enterprise Consumption software licensing model, in which all CPU consumed by all workloads at all times is chargeable, has far-reaching implications for managing CPU consumption in z/OS environments. Sites that have excellent visibility into CPU consumption across the z/OS infrastructure (including subsystems that are significant CPU consumers) will be well-positioned to effectively manage and optimize their environments to minimize software license expense.
The next article in this series will continue this Tailored Fit Pricing Enterprise Consumption theme and explore a wide variety of topics that may enable you to reduce CPU consumption, avoiding MSU overages or creating headroom for expanded business.
[Editor’s Note: If you have not already received a TFP proposal from IBM, you should expect to receive one soon. We have seen some proposals that really were an excellent fit for the client, and other’s that, in our opinion, were not appropriate for that particular customer. Nevertheless, we believe that most z/OS installations will be using TFP within the next few years. Therefore, it would be prudent for you to start thinking now about how you
would track and manage your CPU consumption and attendant software bills if your site were to move to TFP. We would like to thank Todd for this very timely, and characteristically insightful, article on this topic. And we are already looking forward to the second part of this article.]