WebSphere Application Server & Liberty

 View Only

Lessons from the field #12: Native CPU analysis on Linux with Java in OpenShift

By Kevin Grigorenko posted Wed December 29, 2021 08:00 AM


In our previous blog post, we discussed the value of running the Linux perf native CPU sampling profiler to investigate Java CPU usage in production. However, perf generally requires root access and OpenShift application containers generally don't run with such privileged access.

This post will describe how to run perf in OpenShift. We'll create a diagnostic container with perf installed, we'll run that container with root access on the worker node, we'll figure out the target container process ID by using runc, and finally we'll run perf as normal. 

Preparing Java

First, it's best to restart the JVM with certain parameters to improve perf call stacks. With containers, this means adding Java arguments to your Dockerfile and rebuilding your deployment in OpenShift:

  • IBM Java/Semeru/OpenJ9 offer the command line argument -Xjit:perfTool that writes /tmp/perf-$PID.map which is used by perf to resolve JIT-compiled Java method names. If not all symbols are resolved, try adding -Xlp:codecache:pagesize=4k. Only the last -Xjit option is processed, so if there is additional JIT tuning, combine the perfTool option with that tuning; for example, -Xjit:perfTool,exclude={com/example/generated/*}.
  • On HotSpot Java, use -XX:+PreserveFramePointer and something like libperf-jvmti.so or perf-map-agent for JIT-compiled methods.

Here's an example Dockerfile of a Java program that burns one CPU running on Semeru Runtime Open Edition with -Xjit:perfTool:

FROM ibm-semeru-runtimes:open-17-jdk
RUN printf 'public class BurnCPU { public static void main(String... args) { System.out.println("Burning 1 CPU..."); while (true) {} } }' > BurnCPU.java && javac BurnCPU.java
CMD ["java", "-Xjit:perfTool", "BurnCPU"]

This is published on DockerHub so you may create and run a deployment directly for testing:

$ oc create deployment burncpu --image=docker.io/kgibm/burncpu

Creating a perf container

We'll need a container that we'll run on the target worker node that has perf as well as some other useful utilities. Here's an example Dockerfile based on Fedora (note that this distribution doesn't need to match the distribution of your target container):

FROM fedora
RUN dnf install -y perf runc procps-ng binutils less lsof psmisc sysstat vim zip util-linux && \
    dnf clean all

Then build and push this image to your OpenShift registry. Alternatively, you may use a public image from DockerHub that was built from the above: docker.io/kgibm/perfcontainer

Running the perf container

Next, use kubectl, oc, or the OpenShift web console under Workloads } Pods to find the worker node where your Java pod is running.

Once you've found the worker node, start a debug container on that worker node and point to the image created in the previous section. For example:

$ oc debug node/ -t --image=docker.io/kgibm/perfcontainer:latest
Creating debug namespace/openshift-debug-node-5bdmv ...
Starting pod/ ...
To use host binaries, run `chroot /host`
Pod IP:
If you don't see a command prompt, try pressing enter.

Note that OpenShift tends to clean up debug pods very aggressively (idle timeout of about one minute) if no active command is running, so you can run a command like top and then Ctrl^C when you're ready to run more commands.

Run perf top to make sure it's working and Ctrl^C once you've confirmed.

If it's not working, depending on the error message, you may need to enable perf on the node with a command such as:

sysctl -w kernel.perf_event_paranoid=-1

Finding the target process

Next, we'll want to find the target process ID. We can list all Java processes, use runc to dump the container names and then find the right process ID. For example:

sh-5.1# for containerid in $(for pid in $(pgrep -f java); do runc --root /host/run/runc list | grep $pid; done | awk '{print $1}'); do runc --root /host/run/runc state $containerid | grep -e '"id"' -e '"pid"' -e '"rootfs"' -e '"io.kubernetes.pod.name"'; done
  "id": "47f4957dda2502616f717ddae284a467847c4df679fcc308e4d78f3f9624f473",
  "pid": 38956,
  "rootfs": "/var/lib/containers/storage/overlay/645b2b122388c89ea956197ec0795cb31381c76948417c0dc32cca06bf17aaac/merged",
    "io.kubernetes.pod.name": "burncpu-8dbb7b7d5-rm8l8",

In the above example, there's a single Java process on the worker node, its container name is burncpu-8dbb7b7d5-rm8l8, its worker node PID is 38956 and its ephemeral filesystem is at /var/lib/containers/storage/overlay/645b2b122388c89ea956197ec0795cb31381c76948417c0dc32cca06bf17aaac/merged. Actually, from the point of view of the debug container, the ephemeral filesystem needs to be prefixed with /host/, so it's actually at /host/var/lib/containers/storage/overlay/645b2b122388c89ea956197ec0795cb31381c76948417c0dc32cca06bf17aaac/merged.

For IBM Java/Semeru/OpenJ9 with -Xjit:perfTool, we'll want to find the /tmp/perf-$PID.map file. This will be generated in the /tmp folder of the running container rather than the worker node. To find this, we take the ephemeral filesystem link above, go up one directory, and then go down into the diff directory:

sh-5.1# ls -l /host/var/lib/containers/storage/overlay/645b2b122388c89ea956197ec0795cb31381c76948417c0dc32cca06bf17aaac/diff
total 0
drwxr-xr-x. 2 root root 18 Dec 28 16:04 etc
drwxrwxrwt. 3 root root 53 Dec 28 16:04 tmp

There's our container's tmp directory and we can list its contents to show the perf map file:

sh-5.1# ls -l /host/var/lib/containers/storage/overlay/645b2b122388c89ea956197ec0795cb31381c76948417c0dc32cca06bf17aaac/diff/tmp/
total 32
-rw-r-----. 1 root root 30591 Dec 28 16:04 perf-1.map

However, the PID of the perf.map file is PID 1 from inside the container, but for our perf command to map things correctly, we need to take the PID we found from runc above (in this example, 38956) and make a symbolic link to this file inside our debug container's tmp directory with that PID:

sh-5.1# ln -s /host/var/lib/containers/storage/overlay/645b2b122388c89ea956197ec0795cb31381c76948417c0dc32cca06bf17aaac/diff/tmp/perf-1.map /tmp/perf-38956.map

Finally, we can run perf like we normally would. For this usage, we'll generally want to focus on just our target PID. For example:

perf record --call-graph dwarf,65528 -F 99 -g -p 38956 -- sleep 15

Finally, we can analyze the perf.data file as we normally would. The following basic report quickly shows the CPU usage of our process and it has successfully resolved the top stack frame to BurnCPU.main:

# perf report -n --show-cpu-utilization | head -20
# To display the perf.data header info, please use --header/--header-only options.
# Total Lost Samples: 0
# Samples: 1K of event 'cycles'
# Event count (approx.): 45922815255
# Children      Self       sys       usr       Samples  Command         Shared Object      Symbol                                   
# ........  ........  ........  ........  ............  ..............  .................  ..........................................
    98.54%    98.54%     0.00%    98.54%          1470  main            [JIT] tid 38956    [.] BurnCPU.main([Ljava/lang/String;)V_hot

Be careful about the debug container idle timeout mentioned before. Once OpenShift deletes the pod, your perf.data file will be gone. The simplest thing to do is to perform perf archive, grab the /tmp/perf-${PID}.map file and copy these files to the worker node into something like /host/tmp and then download the files from the worker node.

This exercise is complete and you may exit the debug node to delete it.

Clean up

If you used the burncpu container for testing, don't forget to delete the deployment:

$ oc delete deployment burncpu


In summary, this post described how to run the Linux perf native CPU sampling profiler on Java workloads in OpenShift in production. We create a diagnostic container with perf installed and then run it as root on the worker node that is running the target Java process. We find the worker node PID of the target Java process using runc. Then, we create a symbolic link to the perf.map file in the diagnostic container's tmp directory. Finally, we run perf as we normally would.