If your Java workload is experiencing high CPU in production, the basic workflow is generally the following. This article will quickly review the first three steps and then we'll focus on the fourth step in green: Review a native sampling profiler.
Garbage Collection
The first thing to check is whether your garbage collection (GC) is healthy. If the proportion of time in garbage collection is greater than ~10%, then garbage collection is likely using excessive CPU and the application is not performing much useful work. On recent versions of Linux, this is easy to quickly check by looking at CPU usage per thread of a Java process using
top -H -p $PID
. If you see "GC" threads consistently using a large proportion of CPU, then you probably have some GC tuning to do. To perform a comprehensive analysis,
enable verbose garbage collection (which has an overhead of < ~0.2%) and check the proportion of time in garbage collection. IBM provides a free, graphical, as-is tool called the
IBM Monitoring and Diagnostic Tools - Garbage Collection and Memory Visualizer (GCMV). Load the verbosegc file, change the X-axis to the
relevant time period, click "Report," and review the "Proportion of time spent in garbage collection pauses (%)".
Thread Dumps
If garbage collection is healthy, the next thing to do is to
gather a handful of thread dumps about 30 seconds apart (
kill -3 $PID
). Then, look for patterns of stack tops that are using CPU and see if you can optimize them. IBM provides a free, graphical, as-is tool to help with this called the
IBM Thread and Monitor Dump Analyzer (TMDA). One quick tip for the tool is to open the "Thread Detail" view, sort by "Stack Depth" in descending order and review any deep stacks. For example, if you take a bunch of thread dumps and you find a lot of stack tops performing SSL/TLS handshakes before calling a web service, then a simple configuration change of ensuring the re-use of pooled, persistent connections will eliminate that CPU hotspot.
Java Sampling Profiler
If nothing obvious shows up in the thread dumps, the next thing to do is to run a Java sampling profiler. You can imagine this as roughly the same as taking thousands or tens of thousands of thread dumps and using a tool to statistically summarize things for you. For IBM Java and OpenJ9, there is a free, graphical, as-is tool called
IBM Monitoring and Diagnostic Tools - Health Center and it may be enabled in
headless mode. For HotSpot Java, there is a tool called
JDK Mission Control. In general, both JVMs use clever mechanisms to keep the overhead of such tooling below ~2%. In addition to reviewing stack tops, the tools also allow you to break down CPU usage by "tree". What this means is that you might have a high-level application method that executes thousands of other methods and the tool will accumulate all of those "leaf" sample CPU percentages up the tree so that you can analyze approximately how much total CPU usage those higher level application methods are consuming as candidates for optimization. We don't mention tracing profilers in this article because they are generally impractical for production environments other than
targeted tracing.
Native Sampling Profiler
If you've performed the above steps and the causes of the high CPU usage are inconclusive or insufficient, the next thing to do is to run a native sampling profiler. This will provide insight into any hotspots within the JVM itself, as well as any potential hotspots in the Linux kernel. In general, the tool for this job is the Linux perf tool.
Preparing Java
IBM Java/Semeru/OpenJ9 offer the command line argument -Xjit:perfTool
that writes /tmp/perf-$PID.map
which is used by perf
to resolve JIT-compiled Java method names. Restart the Java process with this argument before running perf
. If not all symbols are resolved, try adding -Xlp:codecache:pagesize=4k
. Only the last -Xjit
option is processed, so if there is additional JIT tuning, combine the perfTool
option with that tuning; for example, -Xjit:perfTool,exclude={com/example/generated/*}
. To get assembly-annotated JIT-compiled methods, use libperf-jvmti.so instead of -Xjit:perfTool
.
On HotSpot Java, use -XX:+PreserveFramePointer
and something like libperf-jvmti.so or perf-map-agent to resolve JIT-compiled methods.
Installing perf
Use your Linux package manger to install perf
. For example:
- Modern Fedora/RHEL/CentOS/ubi/ubi-init:
dnf install -y perf
- Older Fedora/RHEL/CentOS:
yum install -y perf
- Debian/Ubuntu:
apt-get update && DEBIAN_FRONTEND=noninteractive TZ=${TZ:-UTC} apt-get -y install perf
- Alpine:
apk update && apk add perf
Note that kernel debug symbols are not required since perf
simply reads /proc/kallsyms
. You may need to install other debug symbols in some cases (e.g. glibc-debuginfo
).
Running perf
There are many ways to run perf
depending on what information you want to gather. For our initial purposes, we simply want to get global CPU stack samples.
Although perf
may be run without root access for analyzing program activity, in general, it's best to run with root access so that kernel activity is also included.
If you just want to get a feeling of activity, simply run perf top. Use the -z
option for non-accumulated snapshots.
For more formal analysis, gather data to a file for a few minutes during the issue. On IBM Java/Semeru/OpenJ9, use the --call-graph dwarf,65528
option for native library stack walking since the JVM is not compiled with -fno-omit-frame-pointer
.
perf record --call-graph dwarf,65528 -T -F 99 -a -g -- sleep 120
On HotSpot Java, omit the --call-graph dwarf,65528
option since HotSpot is compiled without frame pointer omission and using the default frame-pointer based callstack walking is faster:
perf record -T -F 99 -a -g -- sleep 120
This will create a perf.data
file.
It's also possible to correlate captured perf data to other events based on wall-clock time.
Analyzing perf
For a simple summary, run perf report in the same directory as the perf.data
file. For example:
$ perf report --header -n --stdio
# Total Lost Samples: 0
#
# Samples: 38K of event 'cpu-clock'
# Event count (approx.): 384525248680
#
# Children Self Samples Command Shared Object Symbol
# ........ ........ ............ ........... ................. ..........................................
#
68.78% 68.78% 26185 init [kernel.kallsyms] [k] native_safe_halt
|
---native_safe_halt
15.60% 15.60% 5937 swapper [kernel.kallsyms] [k] native_safe_halt
|
---native_safe_halt
15.60% 15.59% 5933 main perf-18353.map [.] BurnCPU.main([Ljava/lang/String;)V_hot
|
---BurnCPU.main([Ljava/lang/String;)V_hot
In this example on an 8-CPU node, one Java thread was consuming about an entire core in BurnCPU.main
which is a similar analysis as what one can get in a Java sampling profiler; however, if there were any native- or kernel-hotspots, those would show up in the perf
output.
For more advanced analysis, consider creating a FlameGraph:
git clone https://github.com/brendangregg/FlameGraph
cd FlameGraph
perf script | ./stackcollapse-perf.pl > out.perf-folded
./flamegraph.pl out.perf-folded > perf.svg
# Open perf.svg in a browser
Sharing perf data
To share perf.data
with support, run perf archive and upload the perf.data
file, the resulting perf.data.tar.bz2
or perf.data.tar.gz
file, and the /tmp/perf-$PID.map
file.
When analyzing on another system, the perf.data
file must be in the current directory, and the perf.data.tar.bz2
or perf.data.tar.gz
file must be extracted into ~/.debug
:
mkdir ~/.debug/
tar xf perf.data.tar.bz2 -C ~/.debug
Then place the perf-$PID.map
file in /tmp
before running perf report
.
Perf in Docker & Kubernetes
When running non-privileged containers, perf
cannot be run within the containers themselves. Instead, run perf
using a root debug container on the worker node and use utilities such as runc
to find container-to-node PID mappings if needed.
Conclusion
To round out the flowchart at the beginning of the article, if all of the above doesn't discover any significant wins, thoroughly review tuning documents such as the IBM WebSphere Application Server Performance Cookbook. For example, with proper caching (e.g. client caching with HTTP response headers, servlet response caching, enterprise caching products, etc.), significant units of work may be simply offloaded or eliminated. Various other techniques such as CPU pinning may help optimize CPU cache efficiency, and so on.
If you've exhausted all reasonable options, the next step is to horizontally scale out by adding more nodes and/or vertically scale up by adding more or bigger CPUs (e.g. faster clock speed, bigger CPU caches, etc.). If there are cost or other constraints to such scaling, then consider engaging your Linux vendor's support process and/or paid professional services with performance tuning expertise; finally, if there are no other options, consider throttling and queuing incoming work.
In summary, there is a well-developed flowchart of dealing with high CPU usage on Linux with Java in production. Each step is progressively more complicated and it culminates in using Linux perf for native stack sampling to look "under the hood" of production Linux runtimes. Although perf is initially intimidating, it's often relatively easy to use and analyze for the most common use cases.
Additional References
#app-platform-swat#automation-portfolio-specialists-app-platform#Java#Linux#performance#troubleshoot#Websphere#WebSphereApplicationServer(WAS)#WebSphereLiberty