zPET - IBM Z and z/OS Platform Evaluation and Test - Group home

ARM Reactions When z/OS Applications Terminate – with System Automation for z/OS settings

  
Automatic Restart Manager (ARM) is a z/OS recovery function intended to help improve availability by providing fast, efficient restarts of critical applications after a failure. In our previous blog “ARM Reactions When z/OS Applications Terminate” we shared some of our experiences with ARM, using one of our address spaces called zCXINST6(an IBM z/OS Container Extensions address space). We also shared that ZCXINST6 was not defined in our IBM System Automation for z/OS (SA z/OS) and only an ARM policy was in effect to control its recovery. Many z/OS environments use SA z/OS to control automated recovery of their applications and SA z/OS can be configured to be aware of ARM, and to coordinate recovery with ARM. In this blog we will share some of our experiences when we had SA z/OS coordinate with ARM.

ARM configuration

1. We defined element ZCXINST6 in our ARM policy. The definition below means ZCXINST6 should be restarted if the element stopped abnormally. For details on the parameters, refer to topic “Automatic restart management parameters for administrative data utility” in the book “z/OS MVS Setting Up a Sysplex”.

2. We defined element restart exit IXC_ELEM_RESTART for ARM with module AOFPERRE. This exit is used to coordinate the restart of an element with SA z/OS. ARM invokes this exit once for each element that is to be restarted, on the system where it will be restarted.


SA z/OS configuration

1. We defined ZCXINST6 as an APL type in our SA z/OS policy, and specified its ARM Element Name as SYSGLZ_ZCXINST6 to correspond with our ARM policy definition. We accomplished this by doing the following: from the SA z/OS policies 'Entry Type Selection' panel, we selected '6 APL' entry. Then in the application list we located ZCXINST6, and then entered 'Application info' panel of ZCXINST6 to specify the ARM Element Name: SYSGLZ_ZCXINST6.

2. From SA INGLIST we specified zCXINST6 automation flag to YES:

3. We added SA minor resource OARM as the following shows. The RESTART FLAG is defined to be N which means SA will take over the restart from ARM, ARM will not do the restart now.
 

Testing
The table below shows different ways one application could be stopped/terminated, and correspondingly whether ARM or SA will try to restart the application. We can see ARM doesn’t take action when an application is intentionally cancelled, but SA can deal with this situation:
Following are tests we performed using the IBM z/OS Container Extensions (zCX) address space we called zCXINST6. We stopped zCX using different commands and monitored if the termination triggered ARM/SA to recover it. In Test1 through Test5 the “restart flag” of resource “OARM” is defined to N, which means SA will take over the restart from ARM, ARM will not do the restart when a failure occurs. In Test6 and Test7 the “restart flag” of resource “OARM” is turned on, which means ARM will control zCXINST6 to restart when a failure occurs.
After starting ZCXINST6, we used command “D XCF,ARMS,DETAIL” to ensure that ZCXINST6 is under ARM control. The “TOTAL RESTARTS: 0” indicates that ARM hasn’t worked to start ZCXINST6 since ZCXINST6 started:

Test 1:
We used the command “P ZCXINST6” to stop ZCXINST6. We observed that the ZCXINST6 stopped and SA started it. ARM does NOT take any actions, since this is a user intentional behavior.

Test 2:
We used the command “C ZCXINST6”to stop ZCXINST6. We observed that ZCXINST6 can’t be stopped using CANCEL. “FORCE” is recommended.

Test 3.
We used command “FORCE ZCXINST6,ARM” to stop ZCXINST6. We observed that ZCXINST6 stopped and SA started it. ARM does NOT attempt restarts for command “FORCE XXX,ARM”. This worked as designed and expected:

Test 4.
We used command “FORCE ZCXINST6,ARM,ARMRESTART” to stop ZCXINST6. We can see that element restart exit prohibited the ARM restart; ARM didn’t take any action. SA restarted it.

Test 5.
We used “$CS(jobid),force” to simulate an abnormal termination. We can see that element restart exit prohibited the ARM restart; ARM didn’t take any action. SA restarted it.

After previous tests which we defined SA to control the element’s restart, in the following tests we modified SA configuration to turn on the “restart flag” for resource “OARM”. ARM will now control zCXINST6 restarts when a failure occurs.

Test 6.
We used command “FORCE ZCXINST6,ARM,ARMRESTART” to stop ZCXINST6. Then ARM restarted it as expected:
From the output of command “D XCF,ARMS,DETAIL”, we can see the count of “TOTAL RESTARTS” increased by 1, as ARM just performed “restart” once:

Test 7.
We used “$CS(jobid),force” to simulate an abnormal termination. We can see ARM restarted it when it detected the job stopped: 
From the following output of command “D XCF,ARMS,DETAIL”, we can see the count of “TOTAL RESTARTS” increased by 1 again:

In this blog we shared our experience that when ARM and SA z/OS are both defined for an element’s restart and recovery, SA z/OS can be configured to be aware of ARM, and to coordinate with ARM. They both can take the necessary recovery actions when needed. Obviously, one difference between ARM and SA z/OS is that if the application is intentionally cancelled, such as by a cancel command, ARM will not restart it. However, SA z/OS can deal with this situation, restarting the application when the application is intentionally cancelled and when a failure/abend occurs. By using the two together, your installation can make decisions about how, or if, an element will be restarted according to their different characteristics.



Authors:

Zhao Yu Wang(wangzyu@cn.ibm.com)
Jing Wen Chen(bjchenjw@cn.ibm.com)
Kieron Hinds(kdhinds@us.ibm.com)
Lora Milczewski(loran@us.ibm.com)