WebSphere Application Server & Liberty

 View Only

Lessons from the field #11: Native CPU analysis on Linux with Java in Production

By Kevin Grigorenko posted Wed November 24, 2021 10:03 AM

  
If your Java workload is experiencing high CPU in production, the basic workflow is generally the following. This article will quickly review the first three steps and then we'll focus on the fourth step in green: Review a native sampling profiler.


Garbage Collection

The first thing to check is whether your garbage collection (GC) is healthy. If the proportion of time in garbage collection is greater than ~10%, then garbage collection is likely using excessive CPU and the application is not performing much useful work. On recent versions of Linux, this is easy to quickly check by looking at CPU usage per thread of a Java process using top -H -p $PID. If you see "GC" threads consistently using a large proportion of CPU, then you probably have some GC tuning to do. To perform a comprehensive analysis, enable verbose garbage collection (which has an overhead of < ~0.2%) and check the proportion of time in garbage collection. IBM provides a free, graphical, as-is tool called the IBM Monitoring and Diagnostic Tools - Garbage Collection and Memory Visualizer (GCMV). Load the verbosegc file, change the X-axis to the relevant time period, click "Report," and review the "Proportion of time spent in garbage collection pauses (%)".

Thread Dumps

If garbage collection is healthy, the next thing to do is to gather a handful of thread dumps about 30 seconds apart (kill -3 $PID). Then, look for patterns of stack tops that are using CPU and see if you can optimize them. IBM provides a free, graphical, as-is tool to help with this called the IBM Thread and Monitor Dump Analyzer (TMDA). One quick tip for the tool is to open the "Thread Detail" view, sort by "Stack Depth" in descending order and review any deep stacks. For example, if you take a bunch of thread dumps and you find a lot of stack tops performing SSL/TLS handshakes before calling a web service, then a simple configuration change of ensuring the re-use of pooled, persistent connections will eliminate that CPU hotspot.

Java Sampling Profiler

If nothing obvious shows up in the thread dumps, the next thing to do is to run a Java sampling profiler. You can imagine this as roughly the same as taking thousands or tens of thousands of thread dumps and using a tool to statistically summarize things for you. For IBM Java and OpenJ9, there is a free, graphical, as-is tool called IBM Monitoring and Diagnostic Tools - Health Center and it may be enabled in headless mode. For HotSpot Java, there is a tool called JDK Mission Control. In general, both JVMs use clever mechanisms to keep the overhead of such tooling below ~2%. In addition to reviewing stack tops, the tools also allow you to break down CPU usage by "tree". What this means is that you might have a high-level application method that executes thousands of other methods and the tool will accumulate all of those "leaf" sample CPU percentages up the tree so that you can analyze approximately how much total CPU usage those higher level application methods are consuming as candidates for optimization. We don't mention tracing profilers in this article because they are generally impractical for production environments other than targeted tracing.

Native Sampling Profiler

If you've performed the above steps and the causes of the high CPU usage are inconclusive or insufficient, the next thing to do is to run a native sampling profiler. This will provide insight into any hotspots within the JVM itself, as well as any potential hotspots in the Linux kernel. In general, the tool for this job is the Linux perf tool.

Preparing Java

IBM Java/Semeru/OpenJ9 offer the command line argument -Xjit:perfTool that writes /tmp/perf-$PID.map which is used by perf to resolve JIT-compiled Java method names. Restart the Java process with this argument before running perf. If not all symbols are resolved, try adding -Xlp:codecache:pagesize=4k. Only the last -Xjit option is processed, so if there is additional JIT tuning, combine the perfTool option with that tuning; for example, -Xjit:perfTool,exclude={com/example/generated/*}. To get assembly-annotated JIT-compiled methods, use libperf-jvmti.so instead of -Xjit:perfTool.

On HotSpot Java, use -XX:+PreserveFramePointer and something like libperf-jvmti.so or perf-map-agent to resolve JIT-compiled methods.

Installing perf

Use your Linux package manger to install perf. For example:

  • Modern Fedora/RHEL/CentOS/ubi/ubi-init:

    dnf install -y perf

  • Older Fedora/RHEL/CentOS:

    yum install -y perf

  • Debian/Ubuntu:

    apt-get update && DEBIAN_FRONTEND=noninteractive TZ=${TZ:-UTC} apt-get -y install perf

  • Alpine:

    apk update && apk add perf

Note that kernel debug symbols are not required since perf simply reads /proc/kallsyms. You may need to install other debug symbols in some cases (e.g. glibc-debuginfo).

Running perf

There are many ways to run perf depending on what information you want to gather. For our initial purposes, we simply want to get global CPU stack samples.

Although perf may be run without root access for analyzing program activity, in general, it's best to run with root access so that kernel activity is also included.

If you just want to get a feeling of activity, simply run perf top. Use the -z option for non-accumulated snapshots.

For more formal analysis, gather data to a file for a few minutes during the issue. On IBM Java/Semeru/OpenJ9, use the --call-graph dwarf,65528 option for native library stack walking since the JVM is not compiled with -fno-omit-frame-pointer.

perf record --call-graph dwarf,65528 -T -F 99 -a -g -- sleep 120

On HotSpot Java, omit the --call-graph dwarf,65528 option since HotSpot is compiled without frame pointer omission and using the default frame-pointer based callstack walking is faster:

perf record -T -F 99 -a -g -- sleep 120

This will create a perf.data file.

It's also possible to correlate captured perf data to other events based on wall-clock time.

Analyzing perf

For a simple summary, run perf report in the same directory as the perf.data file. For example:

$ perf report --header -n --stdio
# Total Lost Samples: 0
#
# Samples: 38K of event 'cpu-clock'
# Event count (approx.): 384525248680
#
# Children      Self       Samples  Command      Shared Object      Symbol                                   
# ........  ........  ............  ...........  .................  ..........................................
#
    68.78%    68.78%         26185  init         [kernel.kallsyms]  [k] native_safe_halt
            |
            ---native_safe_halt
    15.60%    15.60%          5937  swapper      [kernel.kallsyms]  [k] native_safe_halt
            |
            ---native_safe_halt
    15.60%    15.59%          5933  main         perf-18353.map     [.] BurnCPU.main([Ljava/lang/String;)V_hot
            |
            ---BurnCPU.main([Ljava/lang/String;)V_hot

In this example on an 8-CPU node, one Java thread was consuming about an entire core in BurnCPU.main which is a similar analysis as what one can get in a Java sampling profiler; however, if there were any native- or kernel-hotspots, those would show up in the perf output.

For more advanced analysis, consider creating a FlameGraph:

git clone https://github.com/brendangregg/FlameGraph
cd FlameGraph
perf script | ./stackcollapse-perf.pl > out.perf-folded
./flamegraph.pl out.perf-folded > perf.svg
# Open perf.svg in a browser

Sharing perf data

To share perf.data with support, run perf archive and upload the perf.data file, the resulting perf.data.tar.bz2 or perf.data.tar.gz file, and the /tmp/perf-$PID.map file.

When analyzing on another system, the perf.data file must be in the current directory, and the perf.data.tar.bz2 or perf.data.tar.gz file must be extracted into ~/.debug:

mkdir ~/.debug/
tar xf perf.data.tar.bz2 -C ~/.debug

Then place the perf-$PID.map file in /tmp before running perf report.

Perf in Docker & Kubernetes

When running non-privileged containers, perf cannot be run within the containers themselves. Instead, run perf using a root debug container on the worker node and use utilities such as runc to find container-to-node PID mappings if needed.

Conclusion

To round out the flowchart at the beginning of the article, if all of the above doesn't discover any significant wins, thoroughly review tuning documents such as the IBM WebSphere Application Server Performance Cookbook. For example, with proper caching (e.g. client caching with HTTP response headers, servlet response caching, enterprise caching products, etc.), significant units of work may be simply offloaded or eliminated. Various other techniques such as CPU pinning may help optimize CPU cache efficiency, and so on.

If you've exhausted all reasonable options, the next step is to horizontally scale out by adding more nodes and/or vertically scale up by adding more or bigger CPUs (e.g. faster clock speed, bigger CPU caches, etc.). If there are cost or other constraints to such scaling, then consider engaging your Linux vendor's support process and/or paid professional services with performance tuning expertise; finally, if there are no other options, consider throttling and queuing incoming work.

In summary, there is a well-developed flowchart of dealing with high CPU usage on Linux with Java in production. Each step is progressively more complicated and it culminates in using Linux perf for native stack sampling to look "under the hood" of production Linux runtimes. Although perf is initially intimidating, it's often relatively easy to use and analyze for the most common use cases.

Additional References



#app-platform-swat
#automation-portfolio-specialists-app-platform
#Java
#Linux
#performance
#troubleshoot
#Websphere
#WebSphereApplicationServer(WAS)
#WebSphereLiberty
0 comments
118 views

Permalink