Automated alerts about technology problems have been used for decades; think of automobile dashboard lights signaling "check engine," to avoid engine destruction.
Closer to enterprise computing but still decades ago, my small software company had a gadget plugged into power and phone lines—we fondly called it the "Radio Shack Robot"—to monitor power, temperature, noise level, water level and such. Costing less than $200, it saved our non-IBM-but-compatible mainframe from melting down when the air conditioning failed and the computer didn't turn itself off. So when the robot called, I'd drive in, turn off the computer, restart the air conditioning and—when temperature was again tolerable—re-IPL VM. (A newer version of the robot, differently branded and quite useful, is available from sensaphone.com.)
Hardware (e.g., processors, disk/tape drives, etc.) has provided alerts since the mainframe's early days. At first they were mostly red indicator lights relying on operator awareness. Mike Riggs, Manager of Systems and Database Administration at an agency of the Supreme Court of Virginia, notes that more recently, equipment is not just self-aware and -diagnosing but able to summon help, often before problems are noticed.
Monitoring software and its operating environment is more problematic. Broad choices are to configure the native environment (OS and applications) to monitor and alert, purchase and configure monitoring/alerting tools, or build custom mechanisms with tools such as REXX scripting and CMS Pipelines. Many installations blend these options, typically developing agents or other probes connected to native interfaces or commercial products. Unique environments require locally developed monitoring, but those have ongoing and sometimes-hidden support costs; they should be a last resort.
A middle ground between seat-of-pants problem detection and automation can be a dashboard integrating disparate information for easy inspection and at-a-glance alerting. Michael Schmutzok, Mainframe Senior Systems Programmer at Shands HealthCare, described the sort of homegrown tool sites and system programmers often develop first:
“Using an internal Web page as a simple system health check display: red/yellow/green stop light indicators showing status of various system metrics (e.g., MVS Paging space, JES2 SPOOL usage, JOE/JQE usage, CICS and DB2 up/down status, DASD pool free space statistics, tape scratch pool usage). Data is collected via an hourly batch [Interactive Output Facility]—an [System Display and Search Facility] software alternative—job issuing various MVS/JES2/HSM commands with results screen scraped using REXX programs. An hourly job recreates the page. Many times this averted…a JES2 SPOOL issue or noticed that an SMF dataset hadn't been dumped. DASD indicators alerted and allowed creating/adding new disks to DASD pools before free space became an issue. If an indicator turned yellow, something was done about it before it reached red state.”
This sort of tool is an economical way to start on a small scale, identify critical metrics, begin monitoring, and evolve as requirements and risks clarify. The next step can be a heartbeat tool that simply checks status or measures and periodically reports "all is well," either to humans or a watching process. Absence of the message or a negative message indicates something to research. See “Commercial Monitoring Tools” below for product and tool descriptions.
As elsewhere, automation is the key to reliable alerting. z/VM, z/OS, Linux and z/VSE all include scripting and timer-driven tools; use them for predictable and repeatable data gathering. If measurements are only taken when something is wrong, suspected or being griped about, there's no healthy-system baseline with which to compare.
A challenge in automating alerting is tuning triggers to avoid both false alarms and missed alerts. The latter can progress to cascading problems but the former can lead to either ignoring messages or disabling alerts. Some wisdom can be gained by comparing notes with other installations and researching figures of merit online and through user groups, including Computer Measurement Group
which specializes in "ensuring the efficiency and scalability of IT service delivery to the enterprise through measurement, quantitative analysis and forecasting."
Ideally, each problem should be encountered and resolved only once. While that's unattainable, techniques and tools can drive towards it. Two resources for this quest, both from mainframer Dan Skwire, are the book "First Fault Software Problem Solving"
and related LinkedIn
The mindset should be to not only resolve problems but to ensure that a failure won't occur again. Following IT Infrastructure Library
practices or good mainframe techniques is analyzing every failure to learn its root cause and adding an early warning system for it.
Maintain System Health
Better than automated problem detection and quick resolution, of course, is maintaining system health. This returns to dashboards and trend monitoring, either displaying or alerting before a metric becomes critical.
Additional best practices are personal checklists supplementing automation or dealing with things not yet automated. This might include verifying that access to key systems isn't broken by mysterious firewall changes. Of course, such checking can be automated by PC-hosted test programs mimicking user connectivity and behavior.
Today's infrastructures, composed of multiple and heterogeneous systems, can exhibit equally varied but unanticipated symptoms, problems and disasters. Passive and reactive monitoring/control systems simply aren't adequate. Modern systems can be instrumented to alert via pager, text message, email and telephone. Warnings don't yet come via telepathy, but perhaps that will come with implanted Google chips.
When faced with an alert, after verifying that a problem or anomaly exists, determine if it's due to a bug, misconfiguration, misbehaving VM or subsystem, etc. Maintain a healthy skepticism to avoid being trapped by misleading symptoms or by too-quickly anchoring (i.e., being fixated on the first explanation considered).
Commercial Monitoring Tools
Abundant commercial monitoring tools target different computing environments, observe different events and indicators, alert in different ways and—of course—hit different price points. A representative sample includes:
Dashboard for Prevention
- IBM Tivoli OMEGAMON XE products to monitor all things mainframe: z/OS and subsystems, z/VM, Linux, networks, etc. Tivoli products use a common framework to process and display information, IBM Tivoli Monitoring.
- BMC Mainview, providing a tools suite for monitoring diverse mainframe environments, including z/OS and subsystems, networks, Linux and related z/VM systems, batch workloads, worldwide data center management integration, etc.
- CA Technologies' SYSVIEW, which monitors against alerts and problem identification, unifies views of application performance and transaction flow, and gives real-time insights into application performance.
- CA Unified Infrastructure Management (formerly Nimsoft Monitor), a scalable IT monitoring solution providing full visibility into systems and infrastructure performance.
- ASG TMON product suite integrating multiple component tools for monitoring zOS, CICS, IMS, TCP/IP, DB2, WebSphere MQ, VTAM, z/VSE and more.
- IBM Resource Measurement Facility, the company's strategic product for z/OS performance measurement and management. This is the base product to collect performance data for z/OS and Sysplex environments.
- IBM z/OS Management Facility, which provides a Web-based interface that manages aspects of z/OS systems through a browser. For new users it includes a framework of automated tasks, tool tips and online help.
- IBM zAware, an integrated, self-learning, z/OS and Linux analytics solution to help identify unusual system behavior in near real time. Running in a special-purpose firmware partition isolates it from production environments and allows one zAware to monitor multiple systems.
- IBM Health Checker for z/OS, a foundation to simplify and automate identification of potential configuration problems before they impact system availability. It compares active values and settings to those suggested by IBM or installation defined.
- IBM z/OS Predictive Failure Analysis, detecting problems requiring analysis of system history and current trends and potentially avoids soft failures.
- IBM CICS Explorer—an example of a narrowly focused system management tool—offering a simple, integrated and intuitive way of managing one or more CICS systems.
- IBM Operations Manager for VM, which automates monitoring and management of z/VM and Linux guests, helping address issues before they impact service-level agreements. Systems programmers and administrators can automate routine maintenance tasks in response to system alerts; users more easily debug problems by viewing and interacting with consoles for service machines and Linux guests.
- Cullen programming's Virtual Machine Resource System Monitor, a real-time system performance monitor evaluating VM system state, keeping technical support staff aware of changing resource demands and loads.
- Velocity Software, which bills itself as "the z/VM and Linux performance management company." It offers an extensive set of performance analysis, capacity planning, accounting and operational support tools.
- The open source project Xymon Clients for System z OSs, providing Xymon for z/OS, z/VM and z/VSE. Originally developed and maintained by Richard Smrcina under GPLv2 in 2003-2010, the project Savannah took responsibility for further maintenance in 2014.
- CSI International BIM-FAQS/ASO, a z/VSE systems operations tool. It provides extensive system automation, including full message management and enhanced system console support; automated console messages; end-of-job console reporting; attention routine command support; CMS user message routing; and hardcopy file printing, backup and merge.
- SNMP, a standard protocol for managing IP network devices. Many commercial and shareware tools accept SNMP notifications so agents can be written to report on mainframe subsystems or installation-specific components. A less popular protocol is (paraphrasing the old TV commercial) "We find about problems the old fashioned way: When customers complain," often before the most sophisticated alert system triggers. But these complaints are useful, suggesting enhancements to automated monitoring. Just don't make users think it's their job detecting/reporting problems.
Consider two reasons to implement dashboarding, automating and alerting systems: to avoid not anticipating or promptly responding to problems, and to maintain mainframe reputation for premiere reliability.
Gabe Goldberg has developed, worked with, and written about technology for decades. Email him at email@example.com.