Tape Storage

 View Only

TS7700 Grid Resiliency

By Archive User posted Wed December 27, 2017 10:57 PM



The IBM TS7700 family is the latest in the line of tape virtualization products for the IBM Z platform. It is a highly scalable, reliable and high performance tape solution to support today's demanding IT environments.

The first generation of the TS7700 family (TS7740) was shipped in 2006. For more than 10 years, the TS7700 family has continued to evolve. IBM TS7760 is the latest model of the family and became generally available in June 2016. TS7760 adopts IBM POWER8 technology, making it the most powerful and reliable member of the family.

Clients can configure a TS7700 Grid by interconnecting up to eight TS7700 clusters through industry standard IP network infrastructure. Client data redundancy and product resiliency are achieved by some design concepts which make the product unique in the enterprise tape market space.

In this article, I will focus on the most important concepts which make the TS7700 unique and the most resilient tape virtualization product in the world for IBM Z.

Virtual Tape Grid Cloud Architecture

Many of the TS7700's resiliency capabilities are built upon the "Virtual Tape Grid Cloud" architecture. Similar to modern cloud storage systems, TS7700 clients do not need to be aware of where their data exists in their TS7700 Grid even if the grid is made up of multiple (two to eight) TS7700 clusters. Each TS7700 cluster's devices within an entire grid always have access to all virtual tape copies within the grid. There are two key design concepts behind the Virtual Tape Grid Cloud architecture.

The first concept is the "Virtual Tape Composite Library". IBM Z hosts views the entire TS7700 Grid as one large "composite library" with up to 3,968 virtual tape devices. Each TS7700 cluster in the grid is configured as a "distributed library", but IBM Z operating systems with exclusive DFSMS OAM TS7700 support sees them all as a combined composite library. This concept frees IBM Z operating systems from managing data-to-distributed library mapping given it can assume that all the virtual tapes exist somewhere in the TS7700 Grid. No user intervention on the IBM Z side is required during an planned or unplanned TS7700 cluster outage. This concept is unique to the IBM TS7700 family and differentiates itself from its competitors in the market. This concept not only provides simplified business continuance, but it also allows I/O performance to scale horizontally within a TS7700 Grid. The more clusters in the grid, the more virtual tape devices and thus increased I/O performance and connectivity. In this article, I do not cover the performance benefits in depth, but if you're interested in those details, please refer to the IBM TS7700 R4(TS7760) Performance White Paper.

The second key design concept is "Virtual Tape Grid Cloud Access". There is no concept of primary, secondary or standby clusters in a TS7700 Grid. All clusters are equal players like modern cloud storage systems. Data is accessible from any virtual tape device of any TS7700 cluster regardless of where the data physically resides. If a copy does not exist on the "local" TS7700 cluster, then that cluster will automatically look for a copy in the grid and then directs the workload through the IP network to the cluster where the data actually resides. All of this activity is hidden from the operator. The host and applications are not aware that data is being accessed remotely through the TS7700 Grid network.

These two key design concepts are the foundation of the Virtual Tape Grid Cloud architecture. If you're interested in the details of this architecture, please refer to "Chapter2. Architecture, components, and functional characteristics" of the IBM TS7700 Release 4.0 Guide.

Virtual Tape Granular Replication

In order to achieve enterprise level resiliency on a storage system, data redundancy is essential. TS7700 provides "Virtual Tape Granular Replication". Each virtual tape, regardless of where it is created, can have up to eight copies in a grid. An operator can specify when replication occurs, and whether it occurs synchronously (SYNC copy), immediately as part of IBM Z close processing (RUN copy) or asynchronously (Deferred copy). One can also specify which method is used at virtual tape level granularity through DFSMS policy management. In addition, virtual tape granular replication means each site is always aware of which virtual tapes are valid or not. There is no ambiguity in the event of an outage as to which virtual tapes may or may not have completed replication.

Synchronous mode replication enables configurations to achieve a zero recovery point and recovery time objective (RPO/RTO). The TS7700 family introduced the world's first virtual tape granular synchronous copy (SYNC Copy) in 2011. Since then, its popularity has increased as numerous clients all over the world adopt the enterprise level feature. Before the TS7700 SYNC copy feature was available, operators who needed to minimize RPO relied on RUN copy. However, it was only near zero in certain use cases, such as with DFSMShsm ML2. With the introduction of SYNC copy, up to two TS7700 clusters will be kept consistent after each implicit or explicit tape SYNC operation. The data replication occurs independent of receiving a RUN (Rewind/Unload) command from the IBM Z host. As a result, the TS7700 achieves a zero RPO for critical tape workloads, and there is no distance limitations given its industry unique "designed for tape" method of synchronizing data. Additional RUN or Deferred copies can occur once the RUN command is issued. This frees IBM Z operating systems from duplexing for equivalent RPOs. If you're considering the introduction of SYNC copy in your environment, then you'll find the white paper IBM TS7700 Series Best Practices - Synchronous Mode Copy very useful.

TS7700 Grid Resiliency

No matter how redundant IT systems are, problems can still occur. Such problems may be caused by multiple hardware component failures, software problems or infrastructure issues such as network anomalies. In certain cases, the problematic item or effected component must be fenced in order to prevent further impact on the whole system. The TS7700 has introduced tools and automatic detection and fencing features in recent releases which are inspired by IBM Z parallel sysplex failure detection concepts.

There are a number of capabilities incorporated within TS7700 that detect a hardware or software failure, and report it through IBM Z host messages or the TS7700 web interface. Also, the system can be configured to automatically report the failure to IBM and, depending upon the severity, dispatch a service technician. In a very rare case, if the TS7700 believes the issue detected is too critical to continue operating, it can isolate the offending cluster from the grid. These capabilities allows TS7700 to be an enterprise tape solution.

We've delivered even greater resiliency in R3.3 PGA2 and R4.1.1 by providing an IBM z/OS console command called "DIAGDATA" library request. This command makes visible to an operator any symptoms that may impact tape processing. The command is only informational and takes no action, but it helps operators diagnose and isolate a TS7700 cluster that is not functioning properly. Please refer to the IBM TS7700 Series z/OS Host Command Line Request DIAGDATA Guidance.

In addition to the "DIAGDATA" library request command, we've provided several knobs to make the TS7700 even more resilient in certain situations. For example, operators can use the "LOWRANK" library request command to override the TVC selection algorithm, so that specific clusters are less preferred for mounts or copies. Operators can also shorten the time required for service-prep of a cluster. Another example is the "PHYSLIB" command which can force a TS7700 tape attached cluster to make virtual tape copies even if the attached physical library is in a degraded condition. Please refer to IBM TS7700 Series z/OS Host Command Line Request User's Guide for the full list of available command line requests. If you've never looked at the guide, you should find it useful and perhaps you may find a solution that you've been looking for!

Fully Automated Data Resynchronization

As a TS7700 cluster comes online, it needs to synchronize with the other clusters in the grid based on which virtual tapes changed while it was unavailable. How much user intervention is required to perform this synchronization? None at all. All important events targeting a virtual tape (such as mount, demount, category change, etc.) are tracked by "tokens" and those tokens are used to automatically reconcile changes after an outage. For example, let's say Cluster X is coming online following a code upgrade. The other clusters in the grid inform Cluster X which virtual tapes were updated during that time period. Once Cluster X is told which virtual tapes have been updated, it becomes immediately available to IBM Z hosts. In other words, Cluster X does not need to wait for all the data to be copied over before becoming usable. The data resynchronization is done in a background process. Whether it's a planned outage or something more extreme such as a full DR site failover and failback, the TS7700 handles the resynchronization in all directions automatically.

"New" TS7700 Grid Resiliency Enhancements Introduced In the Release 4.1.2

With release 4.1.2, TS7700 introduces several important improvements to make the TS7700 Grid more resilient than ever before. Throughout the successful history of TS7700 over the past 10+ years, we've seen a few, albeit very rare, cases where a single cluster in a TS7700 Grid can affect its peer clusters. In addition to solving many of the known issues today, the improvements in release 4.1.2 are equipped to even handle issues of which we're not yet aware.

These improvements include the ability to proactively identify a problem and isolate it from the grid using various methods. Operators can now configure a TS7700 cluster to automatically fence a peer cluster from the grid based upon user defined thresholds. Furthermore, operators can specify what action should take place after the automatic fencing. The IBM Development team is planning to release a Best Practices Guide about this new, enhanced Grid resiliency function, and it will be available on the
IBM Techdocs website very soon.