WebSphere Application Server & Liberty

Join this online group to communicate across IBM product users and experts by sharing advice and best practices with peers and staying up to date regarding product enhancements.

View Only

Back to Blog List

Lessons from the field #5: Monitor and keep your WebSphere environments running smoothly

By Kevin Grigorenko posted Wed May 26, 2021 01:28 PM

For a video presentation of this topic, see the replay of the Customer Advisory Board session.

Gauges: Small Airplanes, Large Airplanes and Space Shuttles

Fifteen years ago, most administrators had very limited monitoring of their WebSphere Application Server (WAS) environments. Many only watched CPU and memory on the node, and Java heap usage of WAS. Some customers still run like this today.

These gauges are insufficient. It’s akin to flying an airplane with half the gauges broken. A pilot can still look out the window and it might usually work, but at nighttime or in bad weather, lack of proper gauges can be dangerous when flying due to issues like spatial disorientation. Running critical business infrastructure requires more sophisticated monitoring akin to a large, modern airplane with all the right gauges and alerts.

Today, most customers have moved in the direction of more sophisticated monitoring, but this has led to different problems:

Enabling too many gauges makes WAS more like a space shuttle than an airplane, and this is usually overkill that causes confusion.
Despite the availability of the necessary gauges, critical gauges are missing from dashboards, or they’re not watched or alerted on.
Administrators sometimes become overwhelmed during incidents due to poor dashboards, not knowing which gauges to look at, and poor tools, processes, and automation to properly respond.

This article discusses the critical gauges to use to monitor WebSphere environments, how to configure alerts for gauge thresholds, and outlines processes and automation to help keep environments running smoothly.

Deployment Size

First, decide which type of deployment you’re targeting: a small airplane, a large airplane, or a space shuttle. In the vast majority of cases, we recommend targeting a large airplane, even if you might think your application or deployment is intuitively “small”. The reason is that there is little marginal cost to enabling the gauges of a large airplane and they are often valuable.

The following are minimum gauges for small and large airplane deployment targets for the most common types of HTTP-based applications. Additional metrics apply for JMS, EJB, etc. The term “node” is used to refer to the operating environment of a WAS process such as a physical server, virtual machine, or cloud container.

Small Airplane Deployment

Minimum end-to-end gauges:

Average response time
Total throughput

Minimum gauges for each node:

CPU utilization
Memory utilization (excluding file cache)
Operating System Logs’ Warning and Error counts

Minimum gauges for each WAS Java process (JVM):

Thread pool utilization
Connection pool utilization
Java heap utilization
WAS Logs’ Warning and Error counts (System*/messages*, native*/console*, ffdc)

Minimum gauges for each reverse proxy web server:

Connection counts
Logs’ Warning and Error counts (error_log*, http_plugin.log)

Large Airplane Deployment

The large airplane deployment is the generally recommended minimum set of gauges.

Minimum end-to-end gauges:

Average response time
Maximum response time
Total throughput
Error throughput (e.g. HTTP code >= 400)
Application-specific metrics
Request arrival rate

Minimum gauges for each node:

Is it alive and responsive?
CPU utilization
Memory utilization (excluding file cache)
Network utilization
Disk utilization and average service time
Filesystem consumption
Count of TCP retransmissions on LAN interfaces
Operating System Logs’ Warning and Error counts

Minimum gauges for each WAS Java process (JVM):

Is it alive and responsive?
JVM process CPU utilization
Thread pool utilization
Connection pool utilization
Java heap utilization
Proportion of time in garbage collection
Average database response time
Average HTTP response time
WAS Logs’ Warning and Error counts (System*/messages*, native*/console*, ffdc)

Minimum gauges for each reverse proxy web server:

Connection counts including breakdown by connection state
Thread pool utilization
Access log with response time
Logs’ Warning and Error counts (error_log*, http_plugin.log)

WAS Gauge Data

WebSphere Application Server exposes gauge data through these major mechanisms:

WAS traditional:

Performance Monitoring Infrastructure (PMI)
Prometheus style /metrics HTTP endpoint through APAR PH24409 (since 8.5.5.20 / 9.0.5.7)

Liberty:

Prometheus style /metrics HTTP endpoint through mpMetrics
Java MXBeans through monitor
JAX-RS OpenTracing with Jaeger, Zipkin, etc. through mpOpenTracing

Most commonly, this gauge data is integrated into monitoring products. Speak with your IBM account representative for details.

Alerts

Slow Application Processing Detection

WebSphere Application Server provides built-in slow application processing detection. When the threshold is breached, a warning is printed to WAS logs and may be detected by the warning and error gauge.

In general, we recommend setting the threshold (X) to the maximum expected response time (in seconds) plus 20%.

Configure WAS traditional hung thread detection and watch for WSVR0605W warnings:

ibm.websphere.threadmonitor.interval=1
ibm.websphere.threadmonitor.threshold=X

Configure the Liberty requestTiming feature (if needed, tune sampleRate to reduce overhead) and watch for TRAS0112W warnings:

<featureManager><feature>requestTiming-1.0</feature></featureManager>
<requestTiming slowRequestThreshold="Xs" sampleRate="1" />

Gauge Alerts

Below are some ideas for gauge alerts which should be tuned to your particular environment, service level agreements, and applications. Many points make reference to a “historical average” which implies “for that time of day, day of week, season, etc.” Alternatively, you may choose simple thresholds based on observations, SLAs, etc.

Note that often the alert uses != rather than >. For example, average and maximum response times != historical average. Intuitively, you might think that you only have to check for response times exceeding a threshold; however, if they are excessively low, this may also be a symptom of a problem such as errors in the backend that are quickly returning errors to the user (e.g. HTTP 500).

End-to-end gauge alerts:

Average and maximum response times != historical average for > ~1 minute
1. For systems that are responding to humans, ideally, a nice target maximum is 400ms which is approximately the human perception threshold (Herzog et al., 2016)
Request arrival rate != historical average for > ~1 minute
Total throughput != historical average for > ~1 minute
Error throughput != historical average for > ~1 minute

Per-node gauge alerts:

CPU utilization > ~90% for > ~5 minutes
RAM usage (excluding file cache) > ~90%
Disk utilization > ~90% for > ~1 minute
Disk service time > ~10ms
Filesystem consumption > ~90%
Network utilization > ~90%
Network ping time on a LAN > ~1ms
Network TCP retransmits on a LAN != 0

Per-JVM gauge alerts:

WAS traditional application thread pool usage > ~90% for > ~1 minute
Liberty thread pool usage != historical average for > ~1 minute
JDBC connection pool > ~90% for > 1 minute
JDBC average response time > historical average for > ~1 minute
Java garbage collection proportion of time in GC > ~10% for > ~1 minute
Java heap utilization > ~90% of -Xmx for > ~1 minute
WAS traditional logs’ warning and error counts from ffdc, " W " or " E " messages in System*log, and “JVM” messages in native*log > historical average for > ~1 minute
1. There should be a separate alert for slow application processing detection warnings (WSVR0605W)
Liberty logs’ warning and error counts from ffdc, " W " or " E " messages in messages*log, and “JVM” messages in console.log > historical average for > ~1 minute
1. There should be a separate alert for slow application processing detection warnings (TRAS0112W)

Per-reverse proxy web server gauge alerts:

Connection counts != historical average for > ~1 minute
Thread pool usage != historical average for > ~1 minute

Where is the problem?

Modern airplanes have advanced gauges, gauge alerts, self-correction, and auto-pilot capabilities; however, sometimes a human is needed to take manual control or hypothesize where a complicated problem may be, especially in emergencies or when many gauge alerts are going off at once. Similarly, the above gauges and gauge alerts are a good way to find many problems, but a problem determination methodology is still needed, especially for emergencies or very complicated problems.

One of the most common complicated problems is determining the cause of poor performance or a bottleneck. This is particularly true in cloud and microservice architectures which have many tiers and backend services.

For such problems, it’s useful to apply a bit (but not too much!) of queuing theory from computer science. There are three key proximate variables in evaluating performance at each tier:

Request arrival rate
Processing concurrency
Response time

These are indirectly affected by many variables. For example, CPU saturation will impact response times, etc. Nevertheless, these three variables are the foundation of performance.

It’s useful to have a mental model like a toll road with each tier (e.g. WebSphere, Database, etc.) of toll booths (application threads) serving your user requests (cars):

Throughput per tier is how many cars are processed per unit time per tier and that will depend on the number of toll booths (concurrency/threads), arrival rate of cars, and average toll booth processing time (response time). Except for the first tier, the arrival rate of downstream tiers (e.g. database) is a direct result of the throughput of the previous tier (e.g. WebSphere) and this helps reason about where a problem is. Conversely, a slow down at one tier of toll booths (e.g. database) may affect the upstream toll booths (e.g. WebSphere).

The analogy isn’t perfect since a real-life toll road has different starting and destination points. Just imagine that when a request takes its “offramp” (which might be after the final tier, or it might be between tiers due to caching, etc.) it’s looping back “home” to the requesting computer.

Imagine you’re a toll road administrator and you hear that there’s a problem. There’s a big back-up of cars at the database tier of toll booths. What are you first thoughts?

If you were to call or walk up to each toll booth at the database tier in the image above and ask each operator what the holdup is, that’s like a thread dump and it helps you hypothesize what’s slow:

Is something causing the toll booth operators to perform work slowly? Maybe their electronic cash register is being slow (e.g. high CPU utilization)? In the above example, the WebSphere threads are backed up because the database tier is backed up.
Check how many live toll booths there are. Did a lot more than average call out of work? Add backup toll booth operators if available.
Is there a sudden burst of traffic on the road? Maybe there was some sort of event? Can you divert some traffic, add toll booth operators, or make them faster?

Procedures

There are also many non-technical procedures that help you keep your environments running smoothly, including:

Having an up-to-date infrastructure diagram to help reason about problems
Using a proper performance testing strategy before rolling out application changes
Using a maintenance strategy to reduce downtime
Performing a failover strategy in emergencies
Executing standardized runbooks and automation

Automation is worth dwelling on because I’m constantly amazed at how much time highly paid engineers spend on mundane or repeatable tasks. In addition to labor costs, this dramatically impacts time-to-resolution and serviceability as they often can’t manually get sufficient diagnostics across many servers at the right time. Manual actions simply don’t scale. No matter how good an engineer is, getting thread dumps on even handfuls of servers is often impractical, not to mention that this often involves complex security procedures and jumpboxes and humans can make mistakes.

Automation is a critical component to keep environments running smoothly. It often has a high return on investment, freeing developers and site reliability engineers. Playbooks are a good first start, but it’s important to fully automate or script the actions themselves including gathering diagnostics, transferring logs, etc. This is often done by having a central “automation” node on the same LAN as the target nodes, but this should be planned with your security team as they may have guidelines on things such as password protected SSH keys and other automation techniques.

IBM WebSphere Automation for IBM Cloud Pak for Watson AIOps

With the value of automation in mind, IBM WebSphere Automation for IBM Cloud Pak for Watson AIOps (WebSphere Automation) aims to help with automation of WebSphere Application Server traditional and Liberty estates to quickly unlock value with increased security, resiliency and performance. An example of this value is a single dashboard of WAS installations and unresolved security CVEs, vulnerability tracking, and security bulletin notifications. Speak with your IBM account representative for details.

Summary

In sum:

Make sure you’re looking at the right gauges.
Set up alerts when gauges exceed thresholds.
Use a mental model of the architecture to hypothesize where a complicated problem is.
Automate as much as possible.

Appendix

Space Shuttle Deployment

For the few deployments that require more advanced gauges, consider the following additional items:

For each node:

System/user/wait/nice/etc. CPU utilization
CPU context switches
CPU cache misses
CPU migrations
Interrupt counts
CPU and memory pressure statistics
Paging rates
Per-CPU hyperthread utilization
Periodic On-CPU stack sampling
Socket counts by state
Network counters over time (e.g. delayed ACKs)
Network usage by process ID
DNS response times

For each WAS Java process (JVM):

Per-thread CPU utilization
CPU utilization for GC threads
CPU utilization for JIT threads
Request queue time
Global GC pause time
Nursery GC pause time
Size of requested object driving GC
Access log with response time including time-to-first response byte
Average HTTP response bytes
Per-servlet/JSP average and maximum response times
Live HTTP session count
Rolled-back transaction count
Prepared statement cache discard count
Lock contention rates
Deadlock count
TLS handshakes

See our team's previous post in the Lessons from the field series: High Impact AIX Network Tuning

#app-platform-swat
#automation-portfolio-specialists-app-platform
#BestPractices
#WAS
#WebSphereApplicationServer(WAS)

0 comments

103 views

Permalink

https://community.ibm.com/community/user/blogs/kevin-grigorenko1/2021/05/26/lessons-from-the-field-5-monitor-websphere

WebSphere Application Server & Liberty

WebSphere Application Server & Liberty

Lessons from the field #5: Monitor and keep your WebSphere environments running smoothly

By Kevin Grigorenko posted Wed May 26, 2021 01:28 PM

Gauges: Small Airplanes, Large Airplanes and Space Shuttles

Deployment Size

Small Airplane Deployment

Large Airplane Deployment

WAS Gauge Data