Mainframe Storage

Enhancing performance, reliability, and security ensuring the availability of critical business workloads

View Only

Back to Blog List

Storage Controller Health (SCH) Status Messages

By Beth Peterson posted Mon March 23, 2020 08:15 PM

What are Storage Controller Health Status Messages?

The DS8K Storage Controller Health Message function sends alerts at the Logical Control Unit (LCU) level when resources are not available or under service. This process runs on both primary and secondary storage systems in a mirrored environment so alerts could be related to either.

To understand more about the message codes themselves, the first place to start are these helpful IBM Knowledge Center links:

Critical Codes:
IEA077A CRITICAL CONTROLLER HEALTH,MC=cc,TOKEN=dddd,SSID=xxxx,DEVICE NED=tttt.mmm.ggg.pp.ssssssssssss.uuuu,RANK|DA|INTF=iiii,text

Serious Codes:
IEA076E SERIOUS CONTROLLER HEALTH,MC=cc,TOKEN=dddd,SSID=xxxx,DEVICE NED=tttt.mmm.ggg.pp.ssssssssssss.uuuu,RANK|DA|INTF=iiii,text

Moderate Codes:
IEA074I MODERATE CONTROLLER HEALTH,MC=cc,TOKEN=dddd,SSID=xxxx,DEVICE NED=tttt.mmm.ggg.pp.ssssssssssss.uuuu,RANK|DA|INTF=iiii,text

Attentions

When notifying Hosts for these events, the Storage Controller will present Attention status for every Logical Subsystem for every path-group associated with each Logical Subsystem. A Host accessing multiple Logical Subsystems will get a notification per Logical Subsystem. For an array rebuild affecting 16 Logical Subsystems in a 16-way Sysplex, 256 attentions would be raised for the MC x’02’ (RAID ARRAY REBUILD IN PROGRESS) and then again for the MC x’03’ (RAID ARRAY REBUILD COMPLETE). To enable duplicate messages to be ignored each message has a Token. The Token in these messages have a unique value for each Logical Subsystem (LSS), but the message issued for each LSS contains the same Token. The Token value then can be used to identify if a message has already been seen for another LSS.

These messages are monitored by applications such as GDPS and can also be automated with System Automation/Netview. Host actions can be triggered by these messages based on user-specified policies. One such action might be to trigger a HyperSwap in a replicated system from a primary seeing impacting conditions to a secondary system that is not having any sort of impacting issue.

Categories of Events

There are several categories of events. The categories are ordered in decreasing severity.

1. These critical/acute messages indicate an unplanned condition and can indicate data loss or a loss of access. These error will cause the IEA077A messages on z/OS, and may trigger an unplanned HyperSwap. These conditions also will trigger a Call Home to alert the IBM Support of an issue. It is important to have monitoring of both the IEA077A alerts and the Call Home alerts to ensure that the alert is observed and addressed. Call Home allows for IBM Service to be able to address an issue promptly and if not functioning correctly the IBM response can be delayed.

MC	Severity	Description
C0	Acute	Pinned Non-retryable Error in device
C1	Acute	Data loss occurred (FC-08 state)
C2	Acute	Data availability lost (FC-06 state)
C3	Acute	Raid Rank not available (FC-01 state)
C4	Acute	Device Adapter Pair Reset started, access lost The message is normally IEA076E MC x’42’ unless a product switch is set to produce this acute alert and allow for an elevated response to the event.

2. The following IEA076E messages indicate an unplanned condition has occurred. They also require human intervention to determine whether action should be taken. For instance, if the primary volume has a secondary volume that is indicating alerts/errors/warning, it may not be advisable to HyperSwap. In addition, these conditions trigger the Call Home mechanism.

MC	Severity	Description
41	Serious	Data Loss Error occurred during background media scrub
42	Serious	Device Adapter Pair Reset started, access lost

3. The following IEA076E messages also indicate an unplanned condition has occurred. They can be monitored by the operator or application. These conditions can impact performance. Thus, the user might also want to determine if a planned HyperSwap should be performed or if the events are on a Metro Mirror secondary whether to suspend mirroring. These conditions do not generate a Call Home.

MC	Severity	Description
40	Serious	PPRC device I/O operations from primary to secondary are timing out. These operations are retried on different paths for up to 30 seconds.
80	Serious	Storage Controller experiencing repetitive warmstarts, For example, any 10 warmstarts within 1-hour window.

4. This set of IEA074I message codes indicate an unplanned condition that will require human intervention for action to be taken. They also trigger the Call Home mechanism. The user might also want to determine if a planned HyperSwap should be performed. In some of these conditions, a single point of failure has been created by the condition so the storage system has reduced redundancy and a second issue could be problematic.

MC	Severity	Description
04	Moderate	Single cluster mode due to error Call home will occur only if the server is fenced.
07	Moderate	Device Adapter Fenced or Quiesced. This condition may degrade performance of the storage system. Device Adapter redundancy has been lost.
22	Moderate	Secondary Storage Controller failover

5. The next set of IEA074I message codes indicate an unplanned condition has occurred. These messages may be ones that an operator or application would monitor, but it is not required. There is no Call Home except potentially for MC x’01’ and x'02'.

MC	Severity	Description
01	Moderate	Device in Preemptive Reconstruct (PER) mode. This mode may last up to 2 minutes with the frequency of offload governed by a threshold. Note: Call Home may occur for PER Mode if enabled by a product switch.
02	Moderate	Device RAID Array is rebuilding. The rebuild may last a number of hours depending on the size of drive. Call Home will be performed if required for drive replacement.
OD	Moderate	Host Adapter Recovery has started The channel connections will be reset.
10	Moderate	PPRC path degraded due to high failure rate
20	Moderate	Secondary Storage Controller experienced recovery action. This legacy message is no longer used for warmstart, failover, failback
21	Moderate	Secondary Storage Controller warmstart

6. The next set of IEA074I message codes indicate an unplanned condition has been resolved. No Call Home is performed. These messages can be used by monitoring applications to clear an alert generated by the corresponding error event.

MC	Severity	Description
03	Moderate	Device RAID Array finished rebuilding x’02’ marked the start of the event.
06	Moderate	Back to Dual cluster mode x’04’ or x’05’ marked the start of the event.
0E	Moderate	Host Adapter Recovery has ended. x’0D’ marked the start of the event.
OF	Moderate	Device Adapter Pair Reset has completed x'42' or x'C4' marked the start of the event
11	Moderate	PPRC path no longer degraded due to high failure rate x’10’ marked the start of the event.
23	Moderate	Secondary Storage Controller failback x’22’ marked the start of the event.

7. The next set of IEA074I message indicate a planned condition and are provided so the user is aware of the time period when an action to the control unit is occurring. These conditions do not invoke the Call Home mechanism

MC	Severity	Description
05	Moderate	Single cluster mode due to Code load or Service mode.
09	Moderate	SSFI Code Activation has started.
0A	Moderate	SFI Code Activation has completed.
0B	Moderate	HA Code Activation has started.
0C	Moderate	HA Code Activation has completed.

Thanks to Alan McClure, IBM GDPS Development and Level 3 support, Todd Sorenson, DS8K Platform and Error Recovery Team Lead, and Stephen Spor, zSeries Channel Verification Systems Test Engineering, for their expertise.

#EnterpriseStorage
#Storage
#StorageManagementandReporting
#monitoring
#DS8000
#Customerexperienceandengagement
#PrimaryStorage
#DS8900
#Real-timeanalytics
#DS8880

0 comments

19 views

Permalink

https://community.ibm.com/community/user/blogs/beth-peterson1/2020/03/23/storage-controller-health-sch-status-messages

Mainframe Storage

Mainframe Storage

Storage Controller Health (SCH) Status Messages

By Beth Peterson posted Mon March 23, 2020 08:15 PM

What are Storage Controller Health Status Messages?

Attentions

Categories of Events

Permalink

Additional
Resources

Office

Quick Links

Mainframe Storage

Mainframe Storage

Storage Controller Health (SCH) Status Messages

By Beth Peterson posted Mon March 23, 2020 08:15 PM

What are Storage Controller Health Status Messages?

Attentions

Categories of Events

Permalink

Additional Resources

Office

Quick Links

Additional
Resources