One of the hottest topics in IT operations today is the application of AIOps and what it means to enterprises, the tools and processes they leverage. AIOps streamlines the management of IT operations and expedites the resolution of complex issues within modern IT environments. Earlier this year Sreekanth Ramakrishnan posted an article in this community that defined a comprehensive guide to navigate through best practices for taking a hybrid approach to AIOps for IBM Z.
In this blog, we'll take a closer look at the topic of performance and capacity management on IBM Z. This has been an area of focus dating right back to the infancy of the mainframe and is a critical discipline for organizations as they continually plan to balance operational costs with their IT needs. While some of the fundamentals remain the same, the skills required to perform this analysis and the turnaround time to deliver insights has evolved significantly over the years.
Ongoing analysis to optimize performance
The ultimate aim of the IT operations team is to ensure the infrastructure under their management is able to support the needs of the business. When issues occur, it may be the result of unexpected constraint or capacity issues within the environment. The mainframe today is part of a complex hybrid cloud eco-system where transactions will be initiated from many external sources, like a website or mobile device, and ultimately drive workloads on IBM Z, such as updating a customer record within DB2 on z/OS. Workloads can spike at any time of the day and these patterns are less predictable than ever making it critical that adequate resources are available to ensure the risk of operational issues are minimized.
When analyzing operational data, teams can be overwhelmed by both the breadth and depth of information available to them. The mainframe is blessed with have a rich set of metrics available through SMF and other sources, however refining these to identify what's important and what isn't can be time consuming. It is often exacerbated by the multitude of product used by the operations team across different domains, resulting in a lack of system-wide insight.
Many mainframe operational teams also have depended on key domain experts, armed with years of experience and detailed knowledge of unique nature of their enterprises applications and environment, to know what to focus on when performing performance analysis. When challenged with the loss of these experts, newer members of the operations teams need more guidance on where to focus their efforts. This widening skills and expertise gap makes it difficult to build deep data insights and reports which is critical for performance root cause analysis. Ultimately making it costly to identify and validate capacity and performance optimization opportunities.
Making intelligent decisions around capacity management
Traditionally performance and capacity management is a reactive task. Data would be collected on a daily basis, then processed in batch overnight to create a set of reports to analyze. This limits the ability to make timely decisions and generate ad hoc reporting to deep dive into issues. Such approaches are not satisfactory today and the need is for near real-time access to information with the ability to analyze and correlate data from multiple sources to make accurate decisions. The data needs to collected, curated and reported on a near real-time fashion not so much to provide alerting but to have the detailed information to hand to make accurate decisions if an incident has been detected.
When making longer-term capacity planning decisions, the focus is often around making the best use of existing resources and where to make smart investments, for example, purchasing of new hardware or upgrading to latest technologies. Blindly making these capital investments can result in mistakes where the expected benefits are not realized to the level expected. Pulling together the right data make the correct decision can be difficult and time consuming, to say nothing around the capacity planner skills that may be needed to make judgements and recommendations. Ability to model out changes in growth and other patterns quickly and easily can help identify where to optimize resources or take preventative actions to avoid performance and capacity constraints in the future.
Optimizing for cost
Cost management is also been a task that often falls to the same team responsible for performance and capacity management. Given some of the ways software is licensed on the mainframe, the configuration and management of the z/OS environment can have a big impact on the amount that is charged for it. Therefore, the operations team needs to balance out the optimal performance with the licensing costs. Sometimes this balance can go too far in one direction - if the systems is capped or constrained too much then workloads can get backed up and SLA compliance can be impacted - and so having a clear view of what workloads are contributing to software costs, and how it varies over time, is an essential set of reports the performance team need to review. This is true whether the enterprise is using a traditional Rolling 4-hour average to manage software costs, or the more recent Tailored Fit Pricing model.
When there is complete view of the workloads as they relate to cost, then informed decisions can take place around optimizing the existing workloads. This could be moving non-critical workloads to quieter times on the system, removal of non-valuable applications, database re-orgs, or recompilation of some applications with a more modern compiler to reduce the number of MSUs it consumes. With evidence to support decision making, there is greater chances of success in performance and capacity management.
IBM's solution to these challenges: IBM Z Performance and Capacity Analytics
These requirements for modern performance and capacity management have driven some of the most critical features found in IBM Z Performance and Capacity Analytics. To address the challenge of large volumes of operational data, Key Performance Metrics reports have been developed in conjunction with IBM experts to identify the critical data points that need to assessed when analyzing performance across z/OS and major subsystems. The Key Performance Metrics also contribute to lowering the volume of data needed to be collected and curated by specifying a core set of SMF records and field that are required and avoiding over-collection of redundant data.
With this information available for analysis, there are several practical applications that be leveraged to draw important insights into your operational environment. Firstly, you can provide ongoing analysis and identification of deviation from expected performance using Smart Path profiling. This reporting gives insight into resource usage from machine to job level, creating a dynamic threshold evaluated each hour across the week identifying exceptional performance status while limiting false positives. More recently, the Health Metrics Scorecard has been developed to give a holistic view of the environment's health through the ongoing application of key rules that can determine if current configurations might be a cause for existing or future issues.
Proactive use of this data can also aid capacity planning tasks, such as through the application of forecasting algorithms to anticipation of future problems. Or leveraging selected simulation reports that support tasks that need to answer the question of “What If” when making environment changes . For example, modelling existing workload on new processor, or identify potential benefits of zIIP-eligible workload. Often it can be difficult to understand all the different variables that must be considered and so easy to use reporting that can apply various options, can help narrow down the scope of investigation quickly and also provide evidence to support proposed capacity changes.
Finally, with a view to cost management - regardless of whether you are following a Rolling 4 Hour Average or Tailored Fit Pricing licensing model - pre-defined reports provide easy analysis of workload usage and consumption hotspots. With this ongoing analysis, visualization of say, current MSU consumption levels per container or forecasting of future consumption becomes a straightforward task. It also feeds into a process to aid with the identification of opportunities for MSU optimization across workloads to drive down mainframe operational costs. This can be tied to resource accounting for accurate and effective tracking of resource usage, often needed for chargeback to internal or external customers.
Where to learn more
Regardless of where you are on your journey to adopting AIOps within your IBM Z environment there are many resources available to you to learn more. If you have not yet read the new AIOps for IBM Z handbook, this is a great place to start to understand the concepts and technologies around AIOps.
To learn more about IBM Z Performance and Capacity Analysis, you can explore the product page. If you would like to see what your operational data looks like within IBM Z Performance and Capacity Analysis, we have various options in how to trial the product, including our "Rapid Proof of Concept" program, with no need to install any software in your environment. Please feel to reach out to me if this interests you, or you have any questions related to this product. You may also post a question within the AIOps on IBM Z Community.