File and Object Storage

 View Only

Code Install Improvements in 8.41 Phase 2

By Archive User posted Tue December 19, 2017 07:28 PM

IBM TS7700 has been a solid part of IBM Z solutions for over 10 years, supporting z/OS, z/VM, z/TPF and z/VSE. Over 80 percent of Fortune 500 companies around the globe continue to rely on proven IBM Z Systems technologies for their most critical workloads. IBM continues to meet this demand with the TS7700 virtual tape solution by leveraging grid as a cloud tape storage for IBM Z. A TS7700 grid will consist of one to eight clusters. IBM Z hosts view up to 496 devices per cluster (i.e. up to 3968 devices in a single TS7700 grid). The TS7700 supports cumulative FICON throughputs of over 3.6GB/s * 8. The TS7700 provides the first zero RPO synchronous copy method. There is no concept of primary or secondary nodes on the TS7700. There is a consistency awareness of all data in the grid regardless of which cluster a device is connected to and regardless of where the data exists in the grid.
Given all clusters in a grid are equal players, a cluster can be put into service and taken offline, and as long as there is more than one copy of data in the grid, the host still has access to the data. Once the cluster is offline, the remaining clusters in the grid will keep track of changes that occur during the outage, and when the cluster is brought back online the peer clusters will update the cluster with the changes that occurred and necessary copies will be queued up to the cluster.
With more enterprise data moving to more cost-effective storage tiers, storage targets that were once considered third- or fourth-tier are now expected to be equal to that of the most critical tiers of storage. IBM is constantly reminded that the TS7700 is no longer considered a “tape backup solution”. The TS7700 is considered a critical part of business continuance for all customers and any outage on a cluster needs to be minimized. IBM is challenged to minimize any outages for code upgrades and to provide code changes to their customers in a prompt efficient manner. With the 8.41.2xx.xx release, IBM has put focus on minimizing the outage time required for a code upgrade. The outage for a code install is outlined in three phases -- Pre-Install, Install and Post-Install.


One of the largest issues with placing a TS7700 in service and varying a cluster offline is the process of varying the host devices to that cluster offline. When a TS7700 distributed library needs to go into service, a host user goes through a series of manual steps in preparation. This includes manually varying offline all devices for that distributed library across all LPARs and canceling or performing a SWAP on any long running jobs. These steps can elongate the process of a scheduled outage.
With the release of 8.41.2xx.xx, the IBM TS7700 supports the tape Control Unit Initiated Reconfiguration (CUIR) function. This is a software mechanism for tape controllers to automatically request an IBM Z (z/OS) host to vary that TS7700 cluster’s devices offline when service is required. Once service has completed, the TS7700 can also automatically request the same devices be varied online.
Tape CUIR and related commands were introduced to help automate and simplify the process. With this new capability the TS7700 can determine which LPARS are attached and can support specific CUIR capabilities.
The TS7700 starts supporting the CUIR function when all clusters in the Grid have a microcode level of 8.41.2xx.xx or later. Once the TS7700 Composite Library supports CUIR, it will notify the host that it is CUIR capable. The z/OS host starts supporting the CUIR tape function once APAR 0A52376 is installed. The host will notify the TS7700 that it is CUIR capable per path group (LPAR). This will inform the TS7700 which LPARs will understand the new CUIR attention messages. The TS7700 will save this information so it can later send library notification attention messages to those LPARs during service-prep or after online events. Only native z/OS hosts (LPARs) will support the CUIR Tape function.
The Automatic Vary Devices Online (AONLINE) and Manual Vary Device attention messages will trigger the z/OS host to automatically perform the task of varying the devices for that distributed library either online or offline. The automatic attention messages are triggered during Service-Prep and after service is canceled and the TS7700 is online.
The ability to enable or disable CUIR Service Vary and AONLINE Service Vary automation for service-prep is grid scope (composite library). Both options are set to “Disabled” by default. New LI REQ commands are provided to enable and disable CUIR and AONLINE Service Varies.
When CUIR Service Vary is enabled and Service Prep has been invoked, the TS7700 will track grouped (online) devices to all path groups that reported CUIR as being supported and will not enter service until all grouped devices are varied offline (at this point they become ungrouped). A non-busy, offline device is a device that has no path groups grouped to it.
The TS7700 will provide information during service-prep on how many LPARs remain busy and information on which LPARs do not support the command and need to be manually varied offline.
In parallel, with the effort of varying the devices offline, the new code image is pre-staged on a boot disk using a mksysb backup.


Once a cluster is in service and offline, there are two final processes that need to be performed to activate the new code level. The first process includes backing up the database, exporting filesystems, creating new filesystems if needed and pre-staging any processor or firmware upgrades that require a reboot. The system is then rebooted from the mksysb image with the new code level. During the reboot the system hardware is discovered and the system is ready to complete the second process of code activation. The final process consists of hardware configuration, database migrations, importing and configuring filesystems, upgrading drive (DDM) firmware and firmware upgrades of the various controllers and components within the cluster.
In 8.41.2xx.xx, we modified our filesystem export and import processes to process the network shared disks (NSDs) in parallel. On a fully populated cache, this change took the filesystem export down from 75 minutes to 12 minutes. The filesystem import improved from 160 minutes down to 60 minutes. Improvements in database migrations were also achieved by utilizing db2 exports/imports in place of performing database migrates. Increased parallelization of database, hardware and cache configuration processes also contributed to the improvements.
Future improvements include concurrent DDM firmware upgrades. On systems with large cache (i.e. large number of DDMs), if the firmware on the DDMs requires an upgrade then this will add to the outage time. The TS7700 is capable of performing the DDM firmware upgrade concurrently and the process can run in background while the system is online. The DDM upgrade that is executed while the system is offline is much faster than the online process, but it can add 30 minutes to the upgrade time. There is always a tradeoff with concurrent background processes. There is a cost to performance while the process is running in the background and competing with other activity on the cluster. It will be up to the customer to determine what the best solution is for their business. IBM will provide an interface for the customer to throttle the upgrades during periods of high activity.
In 8.41.2xx.xx, these modifications reduced the outage during the code load process by more than 150 minutes for a large configuration!


If the Automatic Vary Devices Online (AONLINE) is enabled, when service is cancelled on the TS7700, an attention message will trigger the IBM Z host to automatically perform the task of varying the devices for that distributed library online.
When a TS7700 is varied back online to a grid, it performs a token merge process with the other peer clusters in the grid. Through this process the TS7700 gains an awareness of what changed while it was offline. Depending on how long the TS7700 was offline and the amount of activity on the grid, a large number of volumes may have been modified and they need to be reconciled. In prior code levels, the amount of time to reconcile was multiplied by the number of clusters in the grid. Each cluster that remained online in the grid, would present to the cluster being varied online almost identical lists of volumes that had been modified while the TS7700 was offline. In 8.41.1xx.xx, the token merge process was modified so that each cluster in the grid presented a unique set of modified volumes to the TS7700 coming online. On a system with high activity, this resulted in improving a hot token merge process from 50 minutes down to 10 minutes in a five cluster grid. The overall effect of these changes resulted in the TS7700 being fully online in a matter of 10 minutes or less rather than N minutes where N= number of clusters in the grid times 10 minutes. This improvement is recognized for any online sequence of a TS7700 not just following a code upgrade.
IBM is very pleased with the success of the TS7700 in the Enterprise mainframe space. The TS7700 is vital to business continuance and IBM has accepted the challenge to minimize any outages associated with code upgrades and the enhancements in the 8.41.2xx.xx release are a major first step.

1 view