In my last blog, I gave an overview about Kubernetes and highlighted how Kubernetes orchestrates application deployments. There are certainly much more capabilities and concepts to discuss but I don't want to overwhelm you with the details at this point. Rather, I think it is probably a good time to review how IBM zSystems shops orchestrate their z/OS systems, today. On z/OS, this discipline is called Automated Operations. There are different products available from different vendors that provide automated operations. But I like to write about IBM's System Automation in this review. So, let's dive in.
What is System Automation?
System Automation is an IBM z/OS product for automated operations on z/OS. It has a long history and has evolved in lockstep with the needs of the business for high availability, the needs of mainframe departments for simplification, cost reduction and at the same time faster and more reliable automated operator responses, and of course also with the progression of the IBM zSystems and z/OS platform providing more and more capabilities over the years that led to an always growing set of workloads.
System Automation is an IBM Z NetView application that leverages NetView's powerful automation engine to monitor application and system status and to automate any interactions without the need of human operators. System Automation uses policy to describe the set of automated resources, their relationships among each other and their desired status to automate the z/OS system. This approach eliminates the need for customers to write code by themselves to control the correct start-up, shutdown and failover sequences of the resources. With the central role of the Automation Manager in a sysplex, System Automation keeps track of all the resources in the entire sysplex and immediately reacts if their observed status deviates from their desired status.
The management scope of System Automation is not limited to the sysplex. In fact, many customers use System Automation as the single point of control to operate their entire z/OS enterprise. With the customizable, graphical user interface component Service Management Unite, operators can not only monitor and control System Automation owned resources, but they can also interact with IBM Z NetView, IBM Z OMEGAMON1 and IBM Z Workload Scheduler2 to integrate performance monitoring and workload automation under one umbrella.
Applications and how they are automated
At the core, most if not all automated operations products on z/OS including System Automation are based on messages. These are text strings that follow a certain standard and that are written to a system console using z/OS write to operator (WTO) and write to operator with reply (WTOR) macros. In some cases, it is the application (started task, job, process) that issues the message, in others it is the system. Each message has an ID followed by some explanatory text and together they convey application and/or system status. Here is an example of a simple status message issued by the z/OS Data Gatherer:
where "ERB105I" is the identifying message ID and "III: DATA GATHERER ACTIVE" indicates that the z/OS Monitor III Data Gather is active.
Additionally, messages have routing and descriptor codes. With the routing code, the issuer can influence to what console a message is sent to. With the descriptor code, the issuer can indicate the criticality of the message. Before the invention of an automated operator in the mid 80s, human operators were responsible for watching such messages and responding to them. But the concept is still used until today. It seems to be very anachronistic, yes, but on the other side it is also a very reliable mechanism to log information to the OPERLOG, informing operators about important situations and at the same time also enabling the operations teams to automatically respond and act.
Using NetView, the Message Revision Table, and Automation Tables, System Automation can trap messages of interest and as a result respond with an automated action. This allows System Automation to follow the lifecycle of an application, i.e., it can see when an application is started and when the application has been fully initialized. Likewise, it can see when an application is stopped or abends and when it finally terminated.
Any change of status due to an error or introduced by an operator who manually interacted with the application, is evaluated immediately by the automation, and can result in a series of coordinated actions that aim for bringing the system back to its desired operational status as defined in the automation policy.
But what does it mean when I speak about "coordinated actions" or "bringing the system back to its desired operational status"? This is called orchestration. Now, very likely the response is not just affecting the application where the status change occurred in the first place, it can also affect other applications, yes even the whole system.
Orchestrating z/OS workloads
The most obvious need for orchestration is during IPL and shutdown times. You want to make sure that the workloads are started and stopped in the right sequence. A batch job that requires a Db2 database shouldn't run before the database is available, right? Otherwise, the batch job will fail, and operators are concerned with unnecessary messages. Similarly, online CICS or IMS transactions that require database access cannot complete successfully without having started the database system before.
With System Automation, the automation administrator has created a policy that describes these dependencies in form of relationships. So, if a CICS region requires a Db2 database, this dependency can be expressed in the System Automation policy using a simple HasParent relationship which reads like this: Start CICS when Db2 is available and stop Db2 only after CICS is unavailable.
Db2 in turn has its own requirements and depends on Resource Recovery Services and JES. Again, this is modelled using a HasParent-relationship. Db2 itself is not just a single resource. In fact, it is a group of resources consisting of the master, the database manager, and the internal resource lock manager address spaces. So rather than having multiple relationships from CICS to Db2, there is only a single relationship to the group that represents the entire Db2 subsystem. Using groups therefore not only simplifies the definition work, but it also simplifies the interaction with the system, as the group and all its members can be started and stopped with a single command.
There is a lot more to say but I don't want to go into all these details how to define policy as it would go beyond the scope of this blog. But I think you can see how this approach of policy-based automation can be applied to many resources. It is quite typical to let System Automation manage several hundreds of resources on a single z/OS system.
As IBM zSystems are designed for 24x7 availability, we also must consider another major use case for orchestration, and this is for planned or unplanned outages. Here, System Automation leverages and fully supports the capabilities of z/OS and parallel sysplex. When parallel sysplex was introduced in the mid 90s, one of the design points was to provide 5 nines of availability from the application point of view. If a system needs to be taken down for maintenance, or if it fails, there are other systems in the parallel sysplex that can still process the application workloads.
In the policy, the System Automation administrator can define powerful groups that allow them to decide on which systems resources should be started preferably, how many resources should be started to meet the demand and what backup systems and resources are available to handle the planned or unplanned outage.
So-called MOVE-groups, strive for having exactly one active member within the sysplex. The member can be itself a group or a single application. Another type of group, the so-called SERVER-group, provides scalability of members across the systems in a sysplex. With this group the number of active members can be scaled up or down based on the workload demand by setting an availability target. SERVER-groups can also be used to recycle its members step by step, one after the other to pick up new service without impacting the overall availability of the workload.
Following the relationship principles mentioned above, such a failover scenario is handled in a very similar way. For both, planned and unplanned application failover, the resources are started on the backup system in the order as specified in the policy. In the planned failover case, System Automation can make sure that all the prerequisite resources are running on the backup system before it stops the application and moves it over to its backup system. This way, it minimizes the overall time of the application outage.
Beyond automated operations
In many installations, System Automation is at the heart of the data center. There are various integration points with other systems management and service management products beyond just the interaction with the operating system.
Many customers use System Automation to send events to their favorite event management system, for instance IBM Netcool OMNIbus or IBM Cloud Pak for Watson AIOps. These events can originate from messages as described above but could also be created from SNMP-traps or performance exceptions, like for instance from OMEGAMON situations.
Very often, customers integrate System Automation with their batch scheduling system, for instance IBM Z Workload Scheduler. Both systems management products complement each other nicely in the way that System Automation cares about the underlying platform and the availability of the middleware, whereas Workload Scheduler cares about the automation and scheduling of the line of business application workloads running on top of this middleware.
Customers often also integrate their automated operations with existing monitoring capabilities, for instance as an event relay or to proactively act when the monitors report metrics going beyond or below warning thresholds. System Automation, for instance, integrates with the IBM OMEGAMON monitors to pull data from them or to react on exceptions and situations.
And very important, System Automation is an integral part of IBM's GDPS disaster recovery solutions being responsible for GDPS-controlled start-up and shutdown and for preparing the backup systems for failover to another site or region.
So, System Automation plays a very important role when it comes to operating an IBM zSystems mainframe environment and it goes beyond just the basic tasks of starting and stopping applications. It orchestrates the sequence how entire workloads are brought to the system and integrates their status with customers' service management products.
Summary
I hope that I could illustrate the most important aspects of automated operations on z/OS in this blog. I've touched on how System Automation takes advantage of its policy-based and goal-oriented automation concept, always managing the resources toward their desired availability status and how powerful groups can be leveraged to ensure high availability of the workloads in the z/OS sysplex.
Please, send me your comments if you think about other aspects that I should mention here in regard how automated operations is implemented in your environment.
Since this blog series is about z/OS containers and their orchestration on z/OS, I think we are now at a point where you have enough understanding of containerization and Kubernetes on one side but also how operations are practiced today on the other side. So therefore, I would like to contrast both approaches and discuss similarities and differences in my next blog. I am looking forward to seeing you there.
1 In combination with the IBM Z Service Management Suite2 In combination with the IBM Z Service Automation Suite