Storage Area Networks (SAN)

Protecting z/OS from Faulty SAN Storage Links

By Archive User posted Mon September 17, 2018 05:46 PM

Executive Summary
IBM recommends that clients exploit the “Improved Channel Path Recovery” (ICR) function , first delivered in z/OS 1.13 (2011), to quickly fence failing I/O resources and minimize any impact on the production work load.

The ICR function allows clients to set a z/OS policy for fencing I/O paths to a storage control unit. The policy includes what scope the scope of recovery, control unit or individual device, the threshold for the number of errors that trigger the fencing action that occur over a specified time interval. IBM recommends a threshold of 10 errors within one minute and a control unit scope for the recovery.

In a fibre channel (FC) Storage Area Network (SAN) using FICON protocols a single bit error may surface to the operating system in a number of ways. If the bit error is part of the frame payload the data being transferred either the FC-2 layer will detect a frame CRC error or the FICON FC-4 layer will detect an end-to-end CRC error. The operation is aborted and an interface control check (IFCC) is surfaced by the IBM Z channel subsystem and an IOS050I message is issued by z/OS. Most IFCC errors can be retried by the device error recovery procedures (ERPs), depending on the device type and the context of the particular set of commands being issued. Prior to allowing the operation to be retried the operating system performs recovery operations to make sure that the device is functioning properly and that higher lever recovery functions are not needed (e.g. selective reset, Reset Allegiance, etc.). These recovery operations typically execute very quickly and have a minimal delay on the production work.

If the bit error occurs on a frame delimiter, header or trailer, then the frame will not be recognized by the receiving end. In this case the frame is lost. The IBM Z FICON protocols will detect lost frames within 1.5 seconds and also surface and IFCC error with a qualifier that indicates an interface timeout occurred. The z/OS operating system issues an IOS051I message in this case. The same operating system recovery procedures are run when interface timeouts occur with some minor differences. Most interface timeout IFCC errors can be retried by the device error recovery procedures (ERPs), depending on the device type and the context of the particular set of commands being issued.

The frequency of bit errors is dependent on the quality of the cabling infrastructure. The faster link speeds are more sensitive to the quality of the cabling infrastructure. Abuse of the cables can lead to an increase in error rates. Abuse includes extreme bending and twisting of the optical cables, leaving the dust covers off when removing the cables, etc. IBM Z channels, control units and SAN switches support technologies such as Forward Error Correction codes (FEC) to provide a self-healing capability from bit errors and technologies such as Read Diagnostic Parameters (RDP) to provide a self-diagnosing capability to quickly identify the faulty links .

The only mechanism for z/OS to fence failing paths prior to the ICR function was to depend on the DASD ERP to exhaust its retry count and see 10 consecutive errors for a single I/O operation. If a retry worked down the failing path no recovery action was performed. This could subject the client to many I/O delays and impact the performance of the work.
Intermittent bit errors can be very disruptive to the production work load. If the errors are detected as interface time out conditions, individual I/O operations are delayed by one and half seconds. A single transaction may have many I/O operations incur an error and get multiple delays (see Figure 1).

Improved Channel Recovery
IBM Z clients typically invest a lot of money in creating a robust I/O infrastructure with enough redundancy to be able to execute their work load in the event of any single failure in the I/O infrastructure (e.g. SAN switch failure, control unit failure, channel failure, book failure, HBA failure, etc.). Most IBM clients would prefer to fence failing components quickly to avoid impact on the production work. z/OS does not default to the ICR function because it will change behavior for the operating system, which clients prefer to be in control of.

In z/OS Version 1 Release 13 (2011) new policy was created for the Input/Output Supervisor (IOS) component. Clients are provided the ability to specify a policy for how aggressive IOS should be in fencing paths to devices that are generating intermittent errors (see Figure 2).

Figure 3 below shows the syntax of the IECIOSxx Parmlib member for specifying the Improved Channel Recovery policy.

Figure 4 shows the policy settings recommended by IBM and adopted by the IBM Z user community.

zHyperWrite and Improved Channel Recovery Intersection
IBM zHyperWrite allows z/OS and middleware such as Db2 and IMS to accelerate the execution of critical write I/O requests by 50% when using synchronous replication such as IBM’s Metro Mirror technology. z/OS does this by executing the I/O requests to the primary device and the secondary device in parallel and depending on middleware recovery functions during DR recovery processing. For zHyperWrite, FICON channels are used to both the primary and secondary copy of the data. These I/O operations are subject to encountering intermittent bit errors on both devices. It is just as likely that the faults can occur while accessing the primary device as the secondary device.

The use of the ICR policy described above is the best way to make sure intermittent path failures are quickly fenced to avoid disruptions.

1 comment



Wed December 18, 2019 07:51 PM

please correct Figure 4, because is equal to Figure 3