Modern Automation for z/OS
Part III: Automated Operations use cases and best practices
Over the last couple of months, in many talks with customers and colleagues about modernization in the context of automation, I noticed quite some confusion around the term "automation" on z/OS. Often this confusion originated from the lack of knowledge of the different automation areas and the tools that are around to do the job. With this multi-part blog, I would like to clear up this confusion and share my perspective on automation from the background of someone who has been working in the z/OS automation business for many years.
Starting with this part of the series, I want to discuss some of the core use cases around automated operations and my recommended best practices for addressing them.
Refer to part I of this blog series for a complete reference of issues, so far.
Background
Before I start discussing the use cases, I think it would be good to level set on how automated operations work on z/OS. Essentially, automated operations are based on capturing console messages issued using z/OS supervisor calls SVC 35 for Write-To-Operator (WTO) or Write-To-Operator-With-Reply (WTOR), and SVC 87 Delete-Operator-Message (DOM). These supervisor calls create a message queue element to store the message text and other attributes describing the message. For instance, one attribute is the so-called routing code that indicates to which consoles the message is sent to. Another attribute to highlight is the descriptor code which indicates how significant the message is for the operations of the system. For z/OS message standards are in place since years that describe how to use these codes and both, automated operations products and operations teams rely on this information.
Before a message is possibly shown on the system console or possibly automated, it travels through various layers of the operating system offering administrators and tools options to adjust message attributes. I say "possibly", because displaying and automation are also attributes of a message that of course could be changed.
The first layer is the z/OS Message Processing Facility (MPF) which defines rules how messages are presented to the user and how messages are managed. These rules are specified in a MPFLSTxx member of SYS1.PARMLIB and hence are owned by the system programmer team. Following MPF processing, message flood automation (MFA) gets control and can decide based on installation-provided policies in a MSGFLDxx parmlib member if a message flood is occurring, for instance due to a device or software failure. The third layer consists of the optional use of an installation-coded exit routine, if applicable. Because up to this point, changes in the system parameters are required, all these definitions are under control of the system programmer.
Subsequently, all subsystems registered on z/OS are informed about a message one after another, and each subsystem in turn has access to the message attributes and can possibly modify them. A product such as IBM Z NetView allows operations teams to determine exactly if and how messages are displayed on the console and if they should be further processed for automation or not. This gives operations teams the flexibility to adapt the system to the specific needs of operators and subject matter experts without changing the z/OS parmlib. NetView's Message Revision Table is a highly efficient mechanism to revise message attributes before the message is passed on to the next subsystem. If the message needs to be automated, it is passed to the NetView address space where the Automation Table determines further actions. Examples are the automated reply of a WTOR, issuing a REXX-script to issue a z/OS command in response of a WTO, or transforming the message into an event that will then be passed to the enterprise event management system, for instance IBM Netcool OMNIbus.
In addition to the examples above, another use of the Automation Table is to let IBM Z System Automation know about the status of every automated resource. With resources and their relationships defined in policy, System Automation can continuously monitor the operational status of the resources, perform target-actual comparison, and adjust the resources based on the rules defined in policy. We will see a specific example of how this is being used, when we discuss the first use case, IPL and shutdown, below.
At the end of the message's journey through the subsystem ring, z/OS sends the message to the operator log for archiving and, unless it should be suppressed, it also sends it to system consoles using the display attributes that are finally set for it.
IPL and shutdown
Upon initial program load (IPL), a control program, here z/OS, is given control of the Logical Partition (LPAR) of the IBM Z server. After initialization of the operating system's nucleus, various types of address spaces must be started. Installations don't have to take care of mandatory system address spaces (such as WLM). However, everything else is up to the operations team to determine. This includes other z/OS-specific support address spaces such as the Library Lookaside Facility or IBM's Global Resource Serialization (GRS). It goes on with security (e.g. z/OS Security Server RACF), monitors (e.g. IBM Resource Measurement Facility or IBM OMEGAMON), the middleware such as Db2 and CICS, and finally applications such as IBM Machine Learning for z/OS. The sequence, however, in which these address spaces start does matter and this is where System Automation's policy-based automation comes to play.
The operating system itself provides only a basic means for starting the address spaces. One can define START commands in a COMMNDxx member in SYS1.PARMLIB and by the time this member is processed, the commands are kicked off in parallel. This assumes that there are no dependencies between such commands, but this is not the case in general considering a complete system IPL.
Looking back twenty-five years before System Automation's policy was introduced, and at how some other automated operations products in the market still work today, script-based IPL and shutdown was common practice. The issue here was that the startup and shutdown sequence including the ability to recover if something does not work as expected, had to be described in form of REXX-code. That led to hundreds of REXX-scripts that needed to be maintained and tested and kept the automation team from implementing higher levels of automation.
With System Automation's policy-based automation, however, such orchestration is possible merely by defining the resources and their relationships in the policy data base as illustrated by the picture below, depicting the *BASE policy, one of several best practice policies coming with System Automation out of the box. System Automation takes a goal-oriented automation approach. That means, no code is necessary to start or stop the address spaces. Everything is defined in policy. The automation manager takes care about the right orders being sent to the automation agents to always strive for the goal, i.e. the desired status of every resource. For IPL, the desired status of every resource is usually the available status. In case of shutdown, it is the unavailable status.
To illustrate this one step further, the policy defines that, for instance, the resource RMF has a HasParent relationship to JES2. During IPL, the desired status is to start every resource. However, the relationship enforces that RMF can only be started when JES2 is available. How is this done? When JES2 is started and finally active, it issues message HASP073. Through message automation and with the help of NetView's Automation Table, System Automation detects the message and updates the status of the JES2 resource. This in turn triggers the re-evaluation of all policy rules. RMF and any other resource that depend on the availability of JES2 can now be started by System Automation.
During system shutdown, the process is in reverse order. Since every resource is desired to be unavailable, the HasParent relationship enforces that JES2 can only be stopped, when all its dependent resources (RMF and others) are unavailable. For instance, RMF is unavailable when System Automation detects message ERB102I. This and every other message signaling the termination of any of JES2's dependents triggers re-evaluation of all policy rules. This goes on until all dependents are observed unavailable and that is the point in time, when System Automation issues the STOP command to finally stop JES2.
Given the explanation above, to me the only reasonable approach for handling this scenario is to use System Automation for IPL and shutdown, because defining the correct startup and shutdown sequence in policy including how to recover if something unforeseen happens, is much easier to maintain and less error prone compared to a logic-based definition. As a side effect of using message automation and asynchronous notifications, System Automation can handle many IPL or shutdown commands in parallel which in turn increases the availability of the system.
Usually, operations teams initiate a planned IPL directly from the IBM Z Hardware Management Console (HMC) as part of activating LPARs. An ordered shutdown on the other side is initiated from System Automation which ensures that all applications, subsystems and system workloads are properly stopped to not risk losing any data.
However, there are also other ways to IPL and shutdown a system. For instance, when GDPS is included (see also Disaster Recovery below), operators use a GDPS panel in NetView or the GDPS graphical user interface for these tasks. GDPS in turn involves System Automation behind the scenes to activate or deactivate the LPARs and to start or stop the workloads.
If you followed closely, you noticed the asymmetry in this process. The system is always brought up via a hardware interface, either from HMC or through the APIs provided by the hardware. However, the system shutdown is always initiated from the existing System Automation.
Following the same pattern, you can now start to think about integrating these IPL and shutdown flows as part of your datacenter-wide automation that you can build using Ansible. With the help of the IBM Z Hardware Management Console collection, ibm_zhmc from the Red Hat Ansible Certified Content for IBM Z solution homepage, you can include hardware operations such as activating and deactivating LPARs from a playbook. And with the uri module that Ansible provides out of the box, you can invoke a REST API of System Automation to shutting down the system.
Disaster recovery
IBM offers several GDPS solutions that help protecting IBM Z customers from data loss and outages in case of disasters. I'd like to illustrate the importance of automated operations by using GDPS Metro, one of several fully supported GDPS solutions based on IBM Z NetView and IBM Z System Automation, as an example.
GDPS Metro is a near continuous availability and disaster recovery solution across two sites separated by metropolitan distance. It is based on IBM Metro Mirror synchronous disk mirroring technology (Peer-to-Peer Remote Copy) provided with IBM DS8900 disk storage subsystems. IBM Z systems under control by GDPS are separated in production systems and so-called control systems. Within each of these systems an instance of IBM Z NetView and IBM Z System Automation are running.
In most of the cases, GDPS helps to manage planned outages. When time has come for system maintenance, the operations team uses the GDPS GUI or NetView panels on one of the control systems to initiate system shutdown and IPL. This use case includes what I have described in the previous section, with the only difference that the process is initiated through and monitored by GDPS.
GDPS makes extensive use of NetView's programming environment and message automation capabilities for communication between the control and production systems, for executing requests, for repetitive health checks of the infrastructure and very importantly, to quickly react on unforeseen events, such as the loss of a data mirror link, the loss of a disk storage subsystem, or even the loss of one or more LPARs.
For the different types of outages, customers have prepared GDPS takeover scripts that define the steps required for recovery. For instance, when customers run their parallel sysplex across two sites in an active-standby configuration, the loss of power on the primary site will trigger a takeover prompt in form of a WTOR. The operations team need to decide whether the production workloads should be moved to the secondary site and once confirmed, GDPS executes the corresponding takeover script that has been prepared for this case.
The script can make use of System Automation's hardware automation capabilities to activate additional capacity on the secondary site, for instance, when the systems are normally used as smaller development and test systems that do not require the larger capacity of production systems. Once prepared, the designated backup LPARs can be activated and IPLed to run the production workloads on these systems as quickly as possible.
Since some time, GDPS also offers REST APIs to control its behavior programmatically from outside. While there is no Ansible collection available today that provides GDPS specific plugins, roles or modules, it is still possible to interact with GDPS from a playbook by using the uri module that comes with Ansible out of the box. The module enables you to create a task in the playbook that invokes the REST API, for instance to run a GDPS script or to suspend or resume the communication with the IBM Z Support Element.
In summary, GDPS is the control point and orchestrator for disaster recovery. Its tight relationship to NetView and System Automation requires that latter are used to interface with the hardware and for starting / stopping the workloads on the system.
Summary
In this blog, I tried to explain in more detail why we need NetView and System Automation in the automated operations domain on IBM Z and z/OS. At the core, it is about preserving and continue to leverage the information about what is automated and how it is automated that customers provided over many years in their existing policy. While it might look old school to some, message automation on z/OS is still the most effective and direct form of automated operations on z/OS.
Availability and cost are probably the most important aspects because customers are performing automated operations with the help of NetView and System Automation. The products are cornerstones for recycling systems or parts of the system as quickly as possible with confidence and provide the basis for disaster recovery driven by GDPS.
At the same time, using REST APIs, these products can also be used together with Ansible. Automation administrators can create playbooks to orchestrate various tasks that can span different platforms and operating systems across the hybrid cloud where a z/OS IPL or shutdown is seamlessly embedded, rather than being delegated to a different process.
In the upcoming issue, I would like to talk about an exciting use case where I am going to illustrate how you can quickly provision workloads on z/OS using Ansible while giving the operations team full visibility and control over these workloads. Looking forward to seeing you then.
P.S. If you like to read about a specific automation scenario, please leave a comment so that I can consider this for a future article.