z/TPF

z/TPF

z/TPF

The z/TPF group is dedicated to sharing news, knowledge, and insights about the z/TPF product family. Consisting of IBMers and users, this community collaborates to advance the potential of high-volume, high-throughput transaction technology.

 View Only

Anomaly Detection (PJ48181)

By Michael Shershin posted 7 hours ago

  

Have you ever encountered a situation where you receive reports that some transactions take a long time to complete on your z/TPF system?  These delays can impact service level agreements (SLAs).

First, you check the high-level system metrics such as utilization, resource usage, and message rates and everything looks normal.  Then you run data collection which confirms that your z/TPF system is running as expected.  After that, you run resource usage by owner name collection (also known as ZMOWN) and/or runtime metrics collection (RTMC) name-value pair collection.  These tools group your transactions by category, allowing you to identify which transaction on average experience long ECB lifetimes.  However, the differences in average ECB life across the various categories of transactions are not significant so you cannot determine which transactions are taking longer than expected.  You could use IP trace to locate some transactions that have a long response time.  However, using IP trace is time consuming and it does not provide a reason why the transaction took longer than expected.  Some type of anomaly happened on your z/TPF system that caused some transactions to take a long time to complete.

APAR PJ48181 helps you detect anomalies that affect transaction response times.  When an anomaly occurs, diagnostic information is collected to help you identify the cause of the anomaly.  If RTMC name-value pair collection is active, this data is also sent to RTMC for analysis.

The intention of anomaly detection is to identify and report outlier conditions that cause a transaction to take much longer than expected. 

APAR PJ48181 supports four anomaly types:

MODQWAIT: This anomaly measures the time it takes for a DASD I/O to complete from when the request is added to the DASD module queue to when the I/O interrupt signals completion.  MODQWAIT anomalies often occur when the DASD module queue is heavily loaded, causing record retrieval to take a long time to complete and increases the ECB’s lifetime.

DISPATCHWAIT: This anomaly measures the time it takes for an ECB to be dispatched.  In the case of a DASD I/O, it is the time from when the DASD I/O interrupt is received to when the ECB starts running on the I-stream again. DISPATCHWAIT anomalies occur when the ready list is overloaded with work.  If the work on the ready list takes a long time to complete, ECB dispatch is delayed, which increases its overall lifetime.

ISTIME: This anomaly type is the amount of time that a single dispatch of an ECB runs on an I-stream before giving up control.  If the ECB runs on an I-stream for a long period of time, the life of the ECB will be increased and it might be larger than expected.

ECBDASDWAIT: This anomaly type is the total time that an ECB waits for DASD I/O over the life of the ECB.  It is possible that every DASD access for an ECB takes the expected amount of time to complete.  However, if the ECB is doing many DASD accesses, the life of the ECB might be larger than expected.

The following figures shows a pictorial representation of each anomaly type.  The first four examples are based on a simple transaction that finds two records from DASD and perform one file operation with no release.  The baseline ECB lifetime for this transaction is 1,200 microseconds (or 1.2 milliseconds).

Anomaly detection support includes the following:

  • The ZMAND command is provided to manage anomaly detection.  You can use this command to set and display limits for the anomaly types.
  • The ECB attribute LOGANOMALY is provided to control whether an ECB is allowed to log an anomaly.  By default, it is set to Yes, so anomalies are logged when encountered.  For utilities and monitors, which are not transactions, you can set LOGANOMALY to NO to prevent logging.
  • A new column was added to the program configuration file so that the LOGANOMALY ECB attribute can be set when the program is entered.
  • The ZRTMC command includes a new control that limits the number of anomalies sent to RTMC when name-value pair collection is active.  This helps prevent RTMC from becoming overloaded when multiple anomalies occur on z/TPF.
  • A control was added to dump processing.  If a dump takes longer than a user-specified period, anomalies will not be logged for 3 seconds after the dump completes.  This assumes that a long dump is likely to trigger anomalies, which do not need to be logged.  You can use the ZMAND command to set this control.
  • Usage data for each anomaly type is collected to help identify a reasonable time limit value for that anomaly type.  Usage data is a series of time range buckets that have a count of the number of times that a value in the time range was used to check for an anomaly.  You can use the ZMAND command to produce a report for usage data.            

For more information about APAR PJ48181, see the APEDIT.

0 comments
3 views

Permalink