Timely reaction to events/issues can be crucial for keeping your business running. Nowadays it's even more important than ever to have a stable, prudent business operations. IBM OMEGAMON products can help to monitor your environment and additionally take actions as soon as issue arise. One of the ways you could automate a response to the issue reported by IBM OMEGAMON is to use z/OS System REXX. Let me show you how you could set it up.
Note: This example is not meant to be a replacement for any System Automation tools which remain essential for managing your z/OS systems.
System REXX overview
System REXX is available for all z/OS users. It is a component of z/OS which allows a REXX exec to invoke system commands and to return results back to the caller in a variety of ways. System REXX execs may be initiated through an assembler macro interface called AXREXX or through an operator command. System REXX script needs to be created in the library which is concatenated to AXR. You can find it by issuing F AXR,SYSREXX REXXLIB command.
Let’s quickly go through a few main system REXX commands (a few of them will be used in the example below):
AXRWTO - issue write to operator (aka. message) to the system log. Note: If the message is long, you should use AXRMLWTO (multi-line WTO).
AXRWTOR - set write to operator reply. Maybe you need to get some reply from operator in order to continue with your automatic actions.
AXRCMD - issue MVS command.
AXRWAIT(n) - pause script for n seconds.
Scenario
A started task starts using more CPU than it should be. We want to alert the Operator and take a DUMP of the started task. If CPU usage on next iteration is still above the threshold stop the started task and report it to the Operator.
How can we achieve this?
First of all, we need to set the IBM OMEGAMON situation which would monitor the CPU usage and if needed call the System REXX script via Take Action.
The ACTION section should contain the modify AXR command to call your REXX script (in this example it's called CPUREXX). Together with the REXX name we will pass two additional parameters - STC name and CPU percent used.
We are now ready to move to the System REXX part as IBM OMEGAMON situation is created.
I have created a script called CPUREXX. We will now review the main parts of it:
- Alert Operator about STC usage above 30%:
ARG parms
PARSE VALUE parms WITH stc cpu .
/* Alert Operator */
msg = 'STC - 'stc' is using 'cpu' CPU. It might be looping.'
CALL ISSUEWTO(msg ',2E')
Where ISSUEWTO function looks as below:
/*------------------------------------------------------------------------*/
/* ISSUEWTO: Issue WTO to the syslog */
/*------------------------------------------------------------------------*/
ISSUEWTO:
ARG text
PARSE VALUE text WITH text ','nb
IF text = '' THEN rcode = 4
/* Issue an alert to the syslog */
rcode = AXRWTO('TEST00'nb text)
IF rcode > 0 THEN EXIT rcode
RETURN 0
How it looks in the SYSLOG
11:37:36.97 S0347946 00000290 F AXR,CPUREXX OMDMSTC 40.0
11:37:36.98 S0280889 00000090 TEST002E STC - OMDMSTC IS USING 40.0 CPU. IT MIGHT BE LOOPING.
- Take the DUMP on the STC if it's not yet taken:
/* Have we taken the STC DUMP already? */
/* YES - issue STOP command; */
/* NO - take the DUMP and exit. */
cmd = 'DISPLAY DUMP,T,DSN=ALL'
AxrCmdrc=AXRCMD(cmd,disp.,3)
IF AxrCmdrc = 0 THEN
DO
/* D DUMP,T,DSN=ALL */
/* IEE853I 10.25.39 SYS1.DUMP TITLES 582 */
/* SYS1.DUMP DATA SETS AVAILABLE=000 AND FULL=000 */
/* CAPTURED DUMPS=0000, SPACE USED=00000000M, SPACE FREE=00040960M */
/* SVCDUMP.D230921.T142523.RSB5.#MASTER#.S00009 TITLE=CPUREXX */
/* DUMP TAKEN TIME=10.25.24 DATE=09/21/2023 */
/* */
i = 1
dumpds = ''
DO i=1 TO disp.0
IF WORDPOS('TITLE=CPUREXX',disp.i) > 0 THEN
DO
PARSE VALUE disp.i WITH dumpi 'TITLE=CPUREXX'.
dumpds = dumpds dumpi
END
END
END
ELSE
DO
msg = cmd' failed. Exiting.'
CALL ISSUEWTO(msg ',5E')
EXIT 0
END
/* If DUMP is not yet taken, then take it. Otherwise stop the STC */
IF dumpds = '' THEN CALL TAKE_DUMP
ELSE CALL STOP_STC
/*******************************************************************************/
/* TAKE_DUMP: */
/*******************************************************************************/
TAKE_DUMP:
msg = 'DUMP was not taken by CPUREXX. Taking it now.'
CALL ISSUEWTO(msg ',6I')
cmd = "DUMP TITLE='CPUREXX'"
AxrCmdrc=AXRCMD(cmd,dump.,3)
IF AxrCmdrc = 0 THEN
DO
/* We got WTOR number, so PARSE it */
/* -DUMP TITLE='DUMP taken by CPUREXX' */
/* *0782 IEE094D SPECIFY OPERAND(S) FOR DUMP COMMAND */
PARSE VALUE dump.1 WITH '*'wtor .
/* Reply to the WTOR */
IF wtor /= '' THEN
DO
cmd = 'R 'wtor',ASID='asid',END'
AxrCmdrc=AXRCMD(cmd,dump.,3)
END
ELSE
DO
msg = 'DUMP - not taken as WTOR number is unknown. Exiting.'
CALL ISSUEWTO(msg ',5W')
EXIT 0
END
END
/* All good - DUMP is taken. Let Operator know and take next action. */
msg = stc' DUMP taken. It will be recycled in 5min if CPU is not reduced.'
CALL ISSUEWTO(msg ',6I')
RETURN 0
How it looks in the SYSLOG
11:37:36.98 *AXT03B5 00000290 D A,OMDMSTC
11:37:36.99 *AXT03B5 00000090 CNZ4106I 11.37.36 DISPLAY ACTIVITY 190
190 00000090 JOBS M/S TS USERS SYSAS INITS ACTIVE/MAX VTAM OAS
190 00000090 00005 00080 00010 00052 00329 00010/00300 00070
190 00000090 OMDMSTC OMDMSTC IEFBR14 IN S A=0039 PER=NO SMC=000
190 00000090 PGN=N/A DMN=N/A AFF=NONE
190 00000090 CT=080.242S ET=198.099S
190 00000090 WUID=S0435420 USERID=OMOMPSTC
190 00000090 WKL=MONITORS SCL=MONITORS P=1
190 00000090 RGP=OMEGTRG1 SRVR=NO QSC=NO
190 00000090 ADDR SPACE ASTE=3FAD2E40
11:37:36.99 S0280889 00000090 TEST003I STC OMDMSTC ASID IS 0039
11:37:36.99 *AXT03B5 00000290 DISPLAY DUMP,T,DSN=ALL
11:37:37.01 00000281 IEF196I IGD103I SMS ALLOCATED TO DDNAME SYS00208
11:37:37.07 00000281 IEF196I IGD104I SVCDUMP.D230925.T115043.RSB5.S3TMS55O.S00040 RETAINED,
11:37:37.07 00000281 IEF196I DDNAME=SYS00208
11:37:37.07 *AXT03B5 00000090 IEE853I 11.37.36 SYS1.DUMP TITLES 196
196 00000090 SYS1.DUMP DATA SETS AVAILABLE=000 AND FULL=000
196 00000090 CAPTURED DUMPS=0000, SPACE USED=00000000M, SPACE FREE=00040960M
196 00000090 SVCDUMP.D230925.T115043.RSB5.S3TMS55O.S00040 TITLE=CT/ENGINE
196 00000090 STORAGE QUIESCE
196 00000090 DUMP TAKEN TIME=07.50.44 DATE=09/25/2023
11:37:37.07 S0280889 00000090 TEST006I DUMP WAS NOT TAKEN BY CPUREXX. TAKING IT NOW.
11:37:37.07 *AXT03B5 00000290 DUMP TITLE='CPUREXX'
11:37:37.08 *AXT03B5 00000090 *1153 IEE094D SPECIFY OPERAND(S) FOR DUMP COMMAND
11:37:37.09 *AXT03B5 00000290 R 1153,ASID=0039,END
11:37:37.09 *AXT03B5 00000090 IEE600I REPLY TO 1153 IS;ASID=0039,END
11:37:37.10 00000090 IEA045I AN SVC DUMP HAS STARTED AT TIME=11.37.37 DATE=09/25/2023 202
202 00000090 FOR ASID (0039)
202 00000090 QUIESCE = NO
11:37:37.33 S0280889 00000090 TEST006I OMDMSTC DUMP TAKEN. IT WILL BE RECYCLED IN 5MIN IF CPU IS NOT
REDUCED.
11:37:39.13 S0435420 00000090 IEA794I SVC DUMP HAS CAPTURED: 204
204 00000090 DUMPID=046 REQUESTED BY JOB (*MASTER*)
204 00000090 DUMP TITLE=CPUREXX
11:37:39.18 00000281 IEF196I IGD17070I DATA SET SVCDUMP.D230925.T153737.RSB5.#MASTER#.S0004
6
11:37:39.18 00000281 IEF196I ALLOCATED SUCCESSFULLY WITH 3 STRIPE(S).
11:37:39.18 00000281 IEF196I IGD17160I DATA SET SVCDUMP.D230925.T153737.RSB5.#MASTER#.S0004
6
11:37:39.18 00000281 IEF196I IS ELIGIBLE FOR COMPRESSION
11:37:39.18 00000281 IEF196I IGD101I SMS ALLOCATED TO DDNAME (SYS00047)
11:37:39.18 00000281 IEF196I DSN (SVCDUMP.D230925.T153737.RSB5.#MASTER#.S00046)
11:37:39.18 00000281 IEF196I STORCLAS (DUMP) MGMTCLAS (DUMP) DATACLAS (EFCOMP5)
11:37:39.18 00000281 IEF196I VOL SER NOS= DMLK03,DMLK00,DMLK01
11:37:41.08 00000281 IEF196I IGD104I SVCDUMP.D230925.T153737.RSB5.#MASTER#.S00046 RETAINED,
11:37:41.09 00000281 IEF196I DDNAME=SYS00047
11:37:41.10 00000090 IEA611I COMPLETE DUMP ON SVCDUMP.D230925.T153737.RSB5.#MASTER#.S00046
215
215 00000090 DUMPID=046 REQUESTED BY JOB (*MASTER*)
215 00000090 FOR ASID (0039)
215 00000090 INCIDENT TOKEN: RSPLEX0K RSB5 09/25/2023 15:38:04
- STOP the STC if CPU is still above the threshold and alert Operator:
/*******************************************************************************/
/* STOP_STC: */
/*******************************************************************************/
STOP_STC:
msg = 'DUMP('STRIP(dumpds)') - already taken. Stopping 'stc'.'
CALL ISSUEWTO(msg',7I')
/* Issue stop command for the STC */
cmd = 'STOP 'stc
AxrCmdrc=AXRCMD(cmd,pcmd.,5)
IF AxrCmdrc = 0 & WORDPOS('REJECTED-TASK BUSY',pcmd.1) = 0 THEN
DO
msg = 'STOP command for 'stc' issued. Exiting.'
CALL ISSUEWTO(msg ',7I')
END
ELSE
DO
msg = 'STOP command for 'stc' failed. RC='AxrCmdrc'. Cancelling 'stc'.'
CALL ISSUEWTO(msg ',8E')
/* CANCEL STC if STOP command hasn't worked */
AxrCmdrc=AXRCMD('CANCEL 'stc,ccmd.,5)
END
RETURN 0
How it looks in the SYSLOG
11:42:36.94 S0347946 00000290 F AXR,CPUREXX OMDMSTC 49.4
11:42:36.95 S0280889 00000090 TEST002E STC - OMDMSTC IS USING 49.4 CPU. IT MIGHT BE LOOPING.
11:42:36.95 *AXT03B5 00000290 D A,OMDMSTC
11:42:36.96 *AXT03B5 00000090 CNZ4106I 11.42.36 DISPLAY ACTIVITY 486
486 00000090 JOBS M/S TS USERS SYSAS INITS ACTIVE/MAX VTAM OAS
486 00000090 00005 00080 00011 00052 00329 00011/00300 00070
486 00000090 OMDMSTC OMDMSTC IEFBR14 IN S A=0039 PER=NO SMC=000
486 00000090 PGN=N/A DMN=N/A AFF=NONE
486 00000090 CT=203.344S ET=498.069S
486 00000090 WUID=S0435420 USERID=OMOMPSTC
486 00000090 WKL=MONITORS SCL=MONITORS P=1
486 00000090 RGP=OMEGTRG1 SRVR=NO QSC=NO
486 00000090 ADDR SPACE ASTE=3FAD2E40
11:42:36.96 S0280889 00000090 TEST003I STC OMDMSTC ASID IS 0039
11:42:36.96 *AXT03B5 00000290 DISPLAY DUMP,T,DSN=ALL
11:42:36.97 00000281 IEF196I IGD103I SMS ALLOCATED TO DDNAME SYS00209
11:42:36.98 00000281 IEF196I IGD104I SVCDUMP.D230925.T153737.RSB5.#MASTER#.S00046 RETAINED,
11:42:36.98 00000281 IEF196I DDNAME=SYS00209
11:42:36.98 *AXT03B5 00000090 IEE853I 11.42.36 SYS1.DUMP TITLES 492
492 00000090 SYS1.DUMP DATA SETS AVAILABLE=000 AND FULL=000
492 00000090 CAPTURED DUMPS=0000, SPACE USED=00000000M, SPACE FREE=00040960M
492 00000090 SVCDUMP.D230925.T153737.RSB5.#MASTER#.S00046 TITLE=CPUREXX
492 00000090 DUMP TAKEN TIME=11.37.37 DATE=09/25/2023
11:42:36.98 S0280889 00000090 TEST007I DUMP(SVCDUMP.D230925.T153737.RSB5.#MASTER#.S00046) - ALREADY
TAKEN. STOPPING OMDMSTC.
11:42:36.98 *AXT03B5 00000290 STOP OMDMSTC
11:42:41.99 S0280889 00000090 TEST008E STOP COMMAND FOR OMDMSTC FAILED. RC=4. CANCELLING OMDMSTC.
11:42:41.99 *AXT03B5 00000290 CANCEL OMDMSTC
11:42:41.99 S0280889 00000090 IEE301I OMDMSTC CANCEL COMMAND ACCEPTED
Summary
We just went through how IBM OMEGAMON Situations and z/OS System REXX could help to automate the recovery and actions to the anomaly seen on the z/OS system. A quick reaction helps to improve the resiliency and efficiency of your environment. We hope you will find it useful.
#OMEGAMON
#IBMz/OS
#REXX
#AIOpsonZ
#monitoring
#IBMZ