WebSphere Application Server & Liberty

 View Only

Lessons from the field #18: Automating OpenShift live container debugging

By Kevin Grigorenko posted Tue June 21, 2022 07:00 AM


In a previous post, we discussed how to perform live container debugging on OpenShift using worker node debug pods. This helps with the case when you want to gather diagnostics from a container without restarting it, but that container doesn’t have sufficient diagnostic tools built-in. In this post, we’ll discuss a new utility named containerdiag that we’ve developed that simplifies and automates this process.

The major caveat is that this tool requires that you are logged in with a user that has the cluster-admin superuser privilege. We are working on alternative tools that accomplish the same goal but that don’t require superuser privileges.

All of the details of this containerdiag tool are documented at a new MustGather page. This post will show simple examples of how to use the tool.

MustGather: Performance, hang, or high CPU issues with WebSphere Application Server on Linux on Containers

Running containerdiag with WebSphere Liberty

If you’ve deployed your application with the OpenLiberty Operator or the WebSphere Liberty Operator, the operators provide some basic serviceability actions such as server dumps; however, as with OpenShift in general, you are limited to the built-in diagnostic capabilities and tools of the image.

If you’ve deployed your application using a custom image without a Liberty operator, you’re similarly limited.

With that in mind, one of the most common requests from IBM Support for performance, hang, or high CPU issues on Linux is the linperf.sh script that gathers operating system information and thread dumps; however, Liberty images often don’t have all of the tools that linperf.sh depends on.

This limitation is resolved with containerdiag. The simplified steps are as follows:

  1. Download containerdiag.sh or containerdiag.bat from the MustGather page.
  2. Login with the oc command with a user that has the cluster-admin superuser privileges
  3. Identify the Liberty deployment name (or a specific pod name with -p) and execute the script:
    containerdiag.sh -d $DEPLOYMENT libertyperf.sh
  4. When it completes, you’ll see a “Files are ready for download” prompt. In a new window, use the specified command to download the diagnostics; for example:
    oc cp worker1-debug:/tmp/containerdiag.abc.tar.gz containerdiag.abc.tar.gz --namespace=ffzhc74l4c
  5. After that completes, go back to your first terminal and type “ok” and press enter to continue and destroy the debug pod.

The resulting containerdiag*.tar.gz file includes everything that linperf.sh gathers (linperf_RESULTS.tar.gz, javacores, and Liberty logs) as well as a Liberty server dump, cgroup information about each pod, and general information about the node. Here is a truncated view of some of the data available:

├── linperf_RESULTS.tar.gz
├── node
│   ├── info
│   │   ├── journalctl_errwarn.txt
│   │   [...]
├── pods
│   └── liberty1-585d8dfd6-jq8cn
│   │   ├── cgroup
│   │   │   ├── cpu
│   │   │   │   ├── cpu.stat
│   │   │   │   [...]
│   │   ├── defaultServer
│   │   │   ├── server.env
│   │   │   └── server.xml
│   │   ├── defaultServer.dump-22.06.20_13.37.26.zip
│   │   ├── javacore.20220620.133617.1.0070.txt
│   │   ├── javacore.20220620.133647.1.0071.txt
│   │   ├── javacore.20220620.133717.1.0072.txt
│   │   ├── logs
│   │   │   └── messages.log
│   └──---- stdouterr.log
└── run_stdouterr.log

Running containerdiag with WebSphere Application Server traditional Base

If you’re running a container with WebSphere Application Server traditional Base, all the steps above are the same except that you should replace libertyperf.sh with twasperf.sh:

containerdiag.sh -d $DEPLOYMENT twasperf.sh

Running containerdiag to gather network trace

tcpdump is built-in along with a script to gather network trace for a number of seconds:

containerdiag.sh -d $DEPLOYMENT -q tcpdump.sh -0 $SECONDS

Running containerdiag to gather Linux perf stack sampling

The Linux perf native sampling profiler is built-in along with a script to gather it for a number of seconds:

containerdiag.sh -d $DEPLOYMENT -q tcpdump.sh -0 $SECONDS

The script even automatically creates a Flame graph of the profile data.

Custom Scripts

You may also run any custom scripts that you’d like with the available tools, or build your own image on top of the containerdiag image.

Future Work

We understand that the cluster-admin privilege requirement is something that is too onerous for some customers and we’re developing alternative methods that don’t require it. However, sometimes root is simply required such as gathering network trace and Linux perf, and sometimes customers can allow cluster-admin privileges for diagnostic purposes, so we thought it was useful to put containerdiag out there so that customers have some options to gather diagnostics in production without restarting pods while we work on other solutions.

We welcome any feedback and issue reports at the GitHub repository for containerdiag.