For a video presentation of this topic, see the replay of the Customer Advisory Board session.
Gauges: Small Airplanes, Large Airplanes and Space Shuttles
Fifteen years ago, most administrators had very limited monitoring of their WebSphere Application Server (WAS) environments. Many only watched CPU and memory on the node, and Java heap usage of WAS. Some customers still run like this today.
These gauges are insufficient. It’s akin to flying an airplane with half the gauges broken. A pilot can still look out the window and it might usually work, but at nighttime or in bad weather, lack of proper gauges can be dangerous when flying due to issues like spatial disorientation. Running critical business infrastructure requires more sophisticated monitoring akin to a large, modern airplane with all the right gauges and alerts.
Today, most customers have moved in the direction of more sophisticated monitoring, but this has led to different problems:
- Enabling too many gauges makes WAS more like a space shuttle than an airplane, and this is usually overkill that causes confusion.
- Despite the availability of the necessary gauges, critical gauges are missing from dashboards, or they’re not watched or alerted on.
- Administrators sometimes become overwhelmed during incidents due to poor dashboards, not knowing which gauges to look at, and poor tools, processes, and automation to properly respond.
This article discusses the critical gauges to use to monitor WebSphere environments, how to configure alerts for gauge thresholds, and outlines processes and automation to help keep environments running smoothly.
Deployment Size
First, decide which type of deployment you’re targeting: a small airplane, a large airplane, or a space shuttle. In the vast majority of cases, we recommend targeting a large airplane, even if you might think your application or deployment is intuitively “small”. The reason is that there is little marginal cost to enabling the gauges of a large airplane and they are often valuable.
The following are minimum gauges for small and large airplane deployment targets for the most common types of HTTP-based applications. Additional metrics apply for JMS, EJB, etc. The term “node” is used to refer to the operating environment of a WAS process such as a physical server, virtual machine, or cloud container.
Small Airplane Deployment
Minimum end-to-end gauges:
- Average response time
- Total throughput
Minimum gauges for each node:
- CPU utilization
- Memory utilization (excluding file cache)
- Operating System Logs’ Warning and Error counts
Minimum gauges for each WAS Java process (JVM):
- Thread pool utilization
- Connection pool utilization
- Java heap utilization
- WAS Logs’ Warning and Error counts (System*/messages*, native*/console*, ffdc)
Minimum gauges for each reverse proxy web server:
- Connection counts
- Logs’ Warning and Error counts (error_log*, http_plugin.log)
Large Airplane Deployment
The large airplane deployment is the generally recommended minimum set of gauges.
Minimum end-to-end gauges:
- Average response time
- Maximum response time
- Total throughput
- Error throughput (e.g. HTTP code >= 400)
- Application-specific metrics
- Request arrival rate
Minimum gauges for each node:
- Is it alive and responsive?
- CPU utilization
- Memory utilization (excluding file cache)
- Network utilization
- Disk utilization and average service time
- Filesystem consumption
- Count of TCP retransmissions on LAN interfaces
- Operating System Logs’ Warning and Error counts
Minimum gauges for each WAS Java process (JVM):
- Is it alive and responsive?
- JVM process CPU utilization
- Thread pool utilization
- Connection pool utilization
- Java heap utilization
- Proportion of time in garbage collection
- Average database response time
- Average HTTP response time
- WAS Logs’ Warning and Error counts (System*/messages*, native*/console*, ffdc)
Minimum gauges for each reverse proxy web server:
- Connection counts including breakdown by connection state
- Thread pool utilization
- Access log with response time
- Logs’ Warning and Error counts (error_log*, http_plugin.log)
WAS Gauge Data
WebSphere Application Server exposes gauge data through these major mechanisms:
WAS traditional:
Liberty:
- Prometheus style /metrics HTTP endpoint through mpMetrics
- Java MXBeans through monitor
- JAX-RS OpenTracing with Jaeger, Zipkin, etc. through mpOpenTracing
Most commonly, this gauge data is integrated into monitoring products. Speak with your IBM account representative for details.
Alerts
Slow Application Processing Detection
WebSphere Application Server provides built-in slow application processing detection. When the threshold is breached, a warning is printed to WAS logs and may be detected by the warning and error gauge.
In general, we recommend setting the threshold (X) to the maximum expected response time (in seconds) plus 20%.
Configure WAS traditional hung thread detection and watch for WSVR0605W warnings:
- ibm.websphere.threadmonitor.interval=1
- ibm.websphere.threadmonitor.threshold=X
Configure the Liberty requestTiming feature (if needed, tune sampleRate to reduce overhead) and watch for TRAS0112W warnings:
<featureManager><feature>requestTiming-1.0</feature></featureManager>
<requestTiming slowRequestThreshold="Xs" sampleRate="1" />
Gauge Alerts
Below are some ideas for gauge alerts which should be tuned to your particular environment, service level agreements, and applications. Many points make reference to a “historical average” which implies “for that time of day, day of week, season, etc.” Alternatively, you may choose simple thresholds based on observations, SLAs, etc.
Note that often the alert uses != rather than >. For example, average and maximum response times != historical average. Intuitively, you might think that you only have to check for response times exceeding a threshold; however, if they are excessively low, this may also be a symptom of a problem such as errors in the backend that are quickly returning errors to the user (e.g. HTTP 500).
End-to-end gauge alerts:
- Average and maximum response times != historical average for > ~1 minute
- For systems that are responding to humans, ideally, a nice target maximum is 400ms which is approximately the human perception threshold (Herzog et al., 2016)
- Request arrival rate != historical average for > ~1 minute
- Total throughput != historical average for > ~1 minute
- Error throughput != historical average for > ~1 minute
Per-node gauge alerts:
- CPU utilization > ~90% for > ~5 minutes
- RAM usage (excluding file cache) > ~90%
- Disk utilization > ~90% for > ~1 minute
- Disk service time > ~10ms
- Filesystem consumption > ~90%
- Network utilization > ~90%
- Network ping time on a LAN > ~1ms
- Network TCP retransmits on a LAN != 0
Per-JVM gauge alerts:
- WAS traditional application thread pool usage > ~90% for > ~1 minute
- Liberty thread pool usage != historical average for > ~1 minute
- JDBC connection pool > ~90% for > 1 minute
- JDBC average response time > historical average for > ~1 minute
- Java garbage collection proportion of time in GC > ~10% for > ~1 minute
- Java heap utilization > ~90% of -Xmx for > ~1 minute
- WAS traditional logs’ warning and error counts from ffdc, " W " or " E " messages in System*log, and “JVM” messages in native*log > historical average for > ~1 minute
- There should be a separate alert for slow application processing detection warnings (WSVR0605W)
- Liberty logs’ warning and error counts from ffdc, " W " or " E " messages in messages*log, and “JVM” messages in console.log > historical average for > ~1 minute
- There should be a separate alert for slow application processing detection warnings (TRAS0112W)
Per-reverse proxy web server gauge alerts:
- Connection counts != historical average for > ~1 minute
- Thread pool usage != historical average for > ~1 minute
Where is the problem?
Modern airplanes have advanced gauges, gauge alerts, self-correction, and auto-pilot capabilities; however, sometimes a human is needed to take manual control or hypothesize where a complicated problem may be, especially in emergencies or when many gauge alerts are going off at once. Similarly, the above gauges and gauge alerts are a good way to find many problems, but a problem determination methodology is still needed, especially for emergencies or very complicated problems.
One of the most common complicated problems is determining the cause of poor performance or a bottleneck. This is particularly true in cloud and microservice architectures which have many tiers and backend services.
For such problems, it’s useful to apply a bit (but not too much!) of queuing theory from computer science. There are three key proximate variables in evaluating performance at each tier:
- Request arrival rate
- Processing concurrency
- Response time
These are indirectly affected by many variables. For example, CPU saturation will impact response times, etc. Nevertheless, these three variables are the foundation of performance.
It’s useful to have a mental model like a toll road with each tier (e.g. WebSphere, Database, etc.) of toll booths (application threads) serving your user requests (cars):
Throughput per tier is how many cars are processed per unit time per tier and that will depend on the number of toll booths (concurrency/threads), arrival rate of cars, and average toll booth processing time (response time). Except for the first tier, the arrival rate of downstream tiers (e.g. database) is a direct result of the throughput of the previous tier (e.g. WebSphere) and this helps reason about where a problem is. Conversely, a slow down at one tier of toll booths (e.g. database) may affect the upstream toll booths (e.g. WebSphere).
The analogy isn’t perfect since a real-life toll road has different starting and destination points. Just imagine that when a request takes its “offramp” (which might be after the final tier, or it might be between tiers due to caching, etc.) it’s looping back “home” to the requesting computer.
Imagine you’re a toll road administrator and you hear that there’s a problem. There’s a big back-up of cars at the database tier of toll booths. What are you first thoughts?
If you were to call or walk up to each toll booth at the database tier in the image above and ask each operator what the holdup is, that’s like a thread dump and it helps you hypothesize what’s slow:
- Is something causing the toll booth operators to perform work slowly? Maybe their electronic cash register is being slow (e.g. high CPU utilization)? In the above example, the WebSphere threads are backed up because the database tier is backed up.
- Check how many live toll booths there are. Did a lot more than average call out of work? Add backup toll booth operators if available.
- Is there a sudden burst of traffic on the road? Maybe there was some sort of event? Can you divert some traffic, add toll booth operators, or make them faster?
Procedures
There are also many non-technical procedures that help you keep your environments running smoothly, including:
- Having an up-to-date infrastructure diagram to help reason about problems
- Using a proper performance testing strategy before rolling out application changes
- Using a maintenance strategy to reduce downtime
- Performing a failover strategy in emergencies
- Executing standardized runbooks and automation
Automation is worth dwelling on because I’m constantly amazed at how much time highly paid engineers spend on mundane or repeatable tasks. In addition to labor costs, this dramatically impacts time-to-resolution and serviceability as they often can’t manually get sufficient diagnostics across many servers at the right time. Manual actions simply don’t scale. No matter how good an engineer is, getting thread dumps on even handfuls of servers is often impractical, not to mention that this often involves complex security procedures and jumpboxes and humans can make mistakes.
Automation is a critical component to keep environments running smoothly. It often has a high return on investment, freeing developers and site reliability engineers. Playbooks are a good first start, but it’s important to fully automate or script the actions themselves including gathering diagnostics, transferring logs, etc. This is often done by having a central “automation” node on the same LAN as the target nodes, but this should be planned with your security team as they may have guidelines on things such as password protected SSH keys and other automation techniques.
IBM WebSphere Automation for IBM Cloud Pak for Watson AIOps
With the value of automation in mind, IBM WebSphere Automation for IBM Cloud Pak for Watson AIOps (WebSphere Automation) aims to help with automation of WebSphere Application Server traditional and Liberty estates to quickly unlock value with increased security, resiliency and performance. An example of this value is a single dashboard of WAS installations and unresolved security CVEs, vulnerability tracking, and security bulletin notifications. Speak with your IBM account representative for details.
Summary
In sum:
- Make sure you’re looking at the right gauges.
- Set up alerts when gauges exceed thresholds.
- Use a mental model of the architecture to hypothesize where a complicated problem is.
- Automate as much as possible.
Appendix
Space Shuttle Deployment
For the few deployments that require more advanced gauges, consider the following additional items:
For each node:
- System/user/wait/nice/etc. CPU utilization
- CPU context switches
- CPU cache misses
- CPU migrations
- Interrupt counts
- CPU and memory pressure statistics
- Paging rates
- Per-CPU hyperthread utilization
- Periodic On-CPU stack sampling
- Socket counts by state
- Network counters over time (e.g. delayed ACKs)
- Network usage by process ID
- DNS response times
For each WAS Java process (JVM):
- Per-thread CPU utilization
- CPU utilization for GC threads
- CPU utilization for JIT threads
- Request queue time
- Global GC pause time
- Nursery GC pause time
- Size of requested object driving GC
- Access log with response time including time-to-first response byte
- Average HTTP response bytes
- Per-servlet/JSP average and maximum response times
- Live HTTP session count
- Rolled-back transaction count
- Prepared statement cache discard count
- Lock contention rates
- Deadlock count
- TLS handshakes
See our team's previous post in the Lessons from the field series:
High Impact AIX Network Tuning#app-platform-swat#automation-portfolio-specialists-app-platform#BestPractices#WAS#WebSphereApplicationServer(WAS)