WebSphere Application Server & Liberty

 View Only

How to debug Java application Crashes

By Ravali Yatham posted Wed July 13, 2022 08:48 AM

  

co-authored by: Pasam Soujanya

Introduction

Application crashes (especially the ones with core dump generation) are by far the most difficult to diagnose and fix, by virtue of the complexities involved in identifying the crash context data. Special tools, artefacts and methodologies are required to troubleshoot such problems and fix those effectively. This blog aims to provide fundamental forms of crashes, list some of the most common application faults and provide the typical problem determination steps. It also provides some of the best practices that can be followed to make sure your application code and configuration is well prepared to avoid crashes as well as effectively troubleshoot those, in the eventuality of unavoidable crashes.

Definition

When running Java applications, users might encounter application crashes. A crash (also known as fatal error) causes the application to stop and terminate abruptly. An application crashes when it performs an operation that is not either permitted by the operating system, or the language semantics specified by the executing platform (in our case, Java language and Java Virtual Machine). In such cases, JVM collects the necessary artefacts for postmortem debugging, and terminates the application gracefully, allowing the developer to diagnose and fix the issue.

Types of Crashes

1) Native crashes (signal based): When the application performs an operation that is illegal from the processor's point of view, it receives a signal. This results in native crash, after printing useful information to the console and collection of coredump, java core, heap dump. Example: crash due to SIGSEGV (segmentation violation)

2) Application level crashes (exception based): Unchecked exceptions which aren't handled by neither the application nor the JVM. This results in abnormal termination of the application, after printing the call sequence of the unhandled exception. Example: NullpointerException

3) Resource usage based crashes (stack or heap overflow): The JVM reserves space for application's thread stack and objects. Due to over-usage of thread stack or the heap, these can exhaust. This results in crashes, after printing the useful error messages and collecting coredump, and heapdump. The former type is often manifested as a StackOverflowError, and the later as OutOfMemoryError. For the object heap while attempt will be made to clean up the unwanted objects from the heap, at times even that will not help the memory demand from the application.

In this blog, we will focus on each of the three types of crashes in detail.

Reasons for crash

There are various possible reasons for a crash to occur in your Java application. In general, these are caused by one or more bugs in:

  • JVM Example: attempting to write to an un-allocated memory region
  • Java Standard Class library Example: Incorrect handling of a datatype results in an exception
  • Third party Java modules Example: An API incorrectly throwing an IOException
  • Application (Java code or native code) Example: An incorrect computation results in division by 0.
  • Operating system (OS) Example: A kernel panic due to memory corruption


Why Crash Cannot be mitigated?

One of the obvious question that comes to many user's mind is: As JVM is a managed runtime, and when the application crashes the JVM is able to intercept the problem and perform many actions such as collecting the artefacts and printing useful messages, why can't the JVM mitigate / avoid the crash and move on? The answer is no. Let us take a simple example of a Banking application that crash due to illegal memory access. Let us assume that the top level action from the application was to update the balance sheet of a specific customer aaccount, pertinent to the bad access. Because the CPU would not permit the application to write at that location, assume the application / JVM decides to absorb the crash, abandone the failed action (memory write) and move on with the rest of the code sequence.Needless to say, the consequence due to that action will be much worse that the crash - due to the fact that the balance sheet is not updated, rest of the calculations will be completely bogus, leading to sections of the bank database to be corrupted, rendering the whole program meaningless, to say the least.


In short, such abnormal conditions in the application must cause the process to stop immediately without executing any further code in the application. JVM is free to run to collect necessary documents. This helps the developer or the support engineer to diagnose the root cause and fix it in the source.

Tools that aid crash debugging

Unlike other production anomalies, crash debugging usually requires usage of one or more tools depending on the nature of the crash, the crashing context and the execution environment (platform, architecture etc. Below listed are the most commonly used tools for crash debugging.

Crash debuggers fall into two categories: i) native debuggers, and ii) runtime-aware debuggers.

1) Native Debugger: Native debuggers are those which work directly with the core dumps, native to the platform. On a positive note, the tool understands the core file format natively. On a negative, the program artefacts such as the internal state, JVM data structures, interpreted and compiled code and symbols, artificial call stack frames etc. are not understood by these debuggers.

Platform Native Debugger
Linux gdb
AIX dbx
MacOS lldb
Windows windbg

In this blog, the illustrations are carried on linux platform so let's look into GNU debugger (gdb) tool.

The gdb allows you to examine and control the execution of code and is useful for evaluating the causes of crashes or general incorrect behavior.  It is useful for debugging native libraries and the JVM itself. Reference: GDB Debugging techniques

2) Runtime-aware debuggerThese are specialisations / extensions built upon the native debuggers. As a result, they not only are capable of launching native core dumps, but also understands runtime specific artefacts. For Java debugging, a good runtime-aware debugger is Dump Viewer(jdmpview). It allows you to examine the contents of system dumps produced from the OpenJ9 VM. You can run the dump viewer on one platform to work with dumps from another platform.One of the most attractive feature is that it can synthesise and reconstruct the call stack as a combination of native (C/C++), interpreted(synthetic) and JIT compiled (Java) method calls. More details about the usage of the tool and various options can be referred here jdmpview


Crash artefacts

If your Java application crashes, there are a number of diagnostic data files that are useful for diagnosing the problem namely 
  1. Javacore : It is a formatted and pre-analyzed text file that is created by the JVM during an event or created by manual intervention. It contains vital information about the running JVM process, such as the JVM command line, Environment information, Snapshot information about all running threads, their stack traces and the monitors (locks) held by the threads
  2. System(core) dump:  System dump is a snapshot of entire  address space of the process. This log plays a vital role in crash debugging, because it contain the entire state of the process at the time of the crash.
  3. Snaptrace: This contains tracepoint data held in trace buffers. In case if the JVM resulted in an assertion failure (assert in java ensures the correctness of program code and data at vital control flow points), the code where assertion occurred can be found in the snap trace.
  4. Jitdump: The Just-In-Time (JIT) compiler produces a binary dump of diagnostic data when a general protection fault (GPF) or abort event occurs which helps for post-moterm analysis specifically for JIT crashes
  5. Standard error(stderr) logs: This log has high level info like error/exception log message which helps identify the type of crash
However, not all these dumps / traces are generated by default when a crash occurs. In order for the JVM to generate these data,  certain diagnostics specific settings needs to be configured before running your application.  Mustgather for Runtimes Java Technology  this document provides Java must-gather information with respect to platforms and problem scenarios.. Certain types of crashes are very rare and difficult to reproduce, so if we miss one chance, we might need to wait for weeks or months before we get the next crash. Because of this reason, it is highly recommended to proactively configure your application to dump these artefacts upon a crash event.


Crash problem determination: case studies

1) Native crashes (signal based): 

Let's consider a Java program that makes a JNI call to the native (C) method that has a buggy code which terminates the program with a segmentation error. Complete details on how to execute the JNI program is illustrated here https://github.com/yathamravali/JNIDemo

Below is the native code which tries to copy a string to name that exceeds the size it can actually hold. More specifically, we have a character pointer `name` which is pointing to a dynamically allocated memory in heap of fixed size i.e 8 bytes. Now in line 8, strcpy library call is trying to put 13 byte string literal to a 8 byte memory area. Obviously, this is bad, so at runtime, theprogram terminated. As the code belongs to JNI and is outside the control of the Java Virtual Machine, no bound checks and other runtime error checking facility are available from the VM.

Sample code:
#include "Crash.h"
#include "jni.h"
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
JNIEXPORT void JNICALL Java_Crash_printHello(JNIEnv *env, jobject obj){
        char *name=(char *)malloc(8);
        strcpy(name,"Ravali Yatham");
        return;
}


Compile the above code as shown below to build a share library
> gcc -fPIC -g -I/home/ravali/Java8SR6/include -o libCrash.so -shared Crash.c
Here -g option was used to include debug information in the generated share library and its content.

Note:
In general the JVM libraries that are shipped with the jre build doesn’t have debug info, debug files needs to be included separately while loading into native debuggers to get the line numbers.

Now run the java program, IBM JVM has the dump agent enabled by default for gpf event which will generate all the artifacts if required OS settings are in place
> java -Djava.library.path=. Crash
Unhandled exception
Type=Segmentation error vmState=0x00040000
J9Generic_Signal_Number=00000018 Signal_Number=0000000b Error_Value=00000000 Signal_Code=00000001
Handler1=00007F477C3AC7D0 Handler2=00007F4777B9F670 InaccessibleAddress=FFFFFFFFFFFFFFA0
RDI=0000000000000000 RSI=0000000000000000 RAX=FFFFFFFFFFFFFFA0 RBX=0000000000000010
RCX=00007F4778000020 RDX=5920696C61766152 R8=00007F47780008D0 R9=00007F477DEB0C40
R10=0000000000000000 R11=0000000000000000 R12=0000000000000000 R13=00007F477C47CCCC
R14=00007F477D30B700 R15=0000000000000000
RIP=00007F47606CC646 GS=0000 FS=0000 RSP=00007F477D30B400
Module=./libCrash.so
Module_base_address=00007F47606CC000 Symbol=Java_Crash_printHello
Symbol_address=00007F47606CC61A
Target=2_90_20191106_432135 (Linux 4.15.0-188-generic)
CPU=amd64 (4 logical CPUs) (0x1f27fd000 RAM)
----------- Stack Backtrace -----------
Java_Crash_printHello+0x2c (0x00007F47606CC646 [libCrash.so+0x646])
(0x00007F477C44E314 [libj9vm29.so+0x141314])
(0x00007F477C44BA37 [libj9vm29.so+0x13ea37])
(0x00007F477C339384 [libj9vm29.so+0x2c384])
(0x00007F477C326100 [libj9vm29.so+0x19100])
(0x00007F477C3E7A12 [libj9vm29.so+0xdaa12])
---------------------------------------

Standarderror message in the console output contains minimal information regarding the fault such as the register info and the module in which crash happened. If you look closely at the stack backtrace above, some of the frames have method names unresolved for the library libj9vm29.so. This is because those libraries are not built with debug flag. This is whereNative debugger helps resolve the method names based on the library base address and offset. We can even get the line numbers of crashing method with debug symbols included.
Now lets focus on debugging, Below are the discrete steps that needs to be followed in order:

a) Load coredump to gdb debugger
(gdb) exec-file /root/Java8SR6/bin/java
(gdb) core core.20220707.041602.17619.0001.dmp 

b) Print backtrace

(gdb) where
#12 <signal handler called>
#13 0x00007f47606cc646 in Java_Crash_printHello (env=0x17d4700, obj=0x18a1ee0) at Crash.c:8
#14 0x00007f477c44e314 in ffi_call_unix64 () at x86/unix64.S:76
#15 0x00007f477c44ba37 in ffi_call (cif=<optimized out>, fn=<optimized out>, rvalue=<optimized out>, avalue=<optimized out>) at x86/ffi64.c:525
#16 0x00007f477c339384 in VM_BytecodeInterpreter::cJNICallout (isStatic=<optimized out>, function=<optimized out>, returnStorage=<optimized out>, returnType=<optimized out>, javaArgs=<optimized out>, 
    receiverAddress=0x18a1ee0, _pc=<optimized out>, _sp=<optimized out>, this=<optimized out>) at BytecodeInterpreter.hpp:2417
#17 VM_BytecodeInterpreter::callCFunction (returnType=<optimized out>, isStatic=<optimized out>, bp=<optimized out>, javaArgs=<optimized out>, receiverAddress=<optimized out>, 
    jniMethodStartAddress=<optimized out>, _pc=<optimized out>, _sp=<optimized out>, this=<optimized out>) at BytecodeInterpreter.hpp:2257
#18 VM_BytecodeInterpreter::runJNINative (_pc=<optimized out>, _sp=<optimized out>, this=<optimized out>) at BytecodeInterpreter.hpp:2149
#19 VM_BytecodeInterpreter::run (this=0x0, this@entry=0x7f477d30b8c0, vmThread=0xffffffffffffffa0) at BytecodeInterpreter.hpp:9548
#20 0x00007f477c326100 in bytecodeLoop (currentThread=<optimized out>) at BytecodeInterpreter.cpp:109
#21 0x00007f477c3e7a12 in c_cInterpreter () at xcinterp.s:160
#22 0x00007f477c398f28 in runCallInMethod (env=0x7f477d30b9d0, receiver=<optimized out>, clazz=0x18a1f50, methodID=0x7f477843c278, args=0x7f477d30bd88) at callin.cpp:1083
#23 0x00007f477c3afcb9 in gpProtectedRunCallInMethod (entryArg=0x7f477d30bd40) at jnicsup.cpp:258
#24 0x00007f4777ba03d3 in omrsig_protect (portLibrary=0x7f477cb083a0 <j9portLibrary>, fn=0x7f477c3f0bf0 <signalProtectAndRunGlue>, fn_arg=0x7f477d30bce0, 
    handler=0x7f477c3ac7d0 <structuredSignalHandler>, handler_arg=0x17d4700, flags=506, result=0x7f477d30bcd8) at ../../omr/port/unix/omrsignal.c:425
#25 0x00007f477c3f0c8c in gpProtectAndRun (function=0x7f477c3afc80 <gpProtectedRunCallInMethod(void*)>, env=0x17d4700, args=0x7f477d30bd40) at jniprotect.c:78
#26 0x00007f477c3b15ff in gpCheckCallin (env=0x17d4700, receiver=receiver@entry=0x0, cls=0x18a1f50, methodID=0x7f477843c278, args=args@entry=0x7f477d30bd88) at jnicsup.cpp:441
#27 0x00007f477c3af68a in callStaticVoidMethod (env=<optimized out>, cls=<optimized out>, methodID=<optimized out>) at jnicgen.c:288
#28 0x00007f477e0ca2cb in JavaMain () from /root/Java8SR6/bin/../lib/amd64/jli/libjli.so
#29 0x00007f477e2e36db in start_thread (arg=0x7f477d30c700) at pthread_create.c:463
#30 0x00007f477dbe671f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Backtrace helps identify the calling sequence that led to the crash. Each line in the backtrace represents one method frame - the data associated with call to one function. The frame contains the arguments given to the function, the function's local variables, and the address at which the function is executing.
From above, frame 13 is where crash happened

c) 
Dump the crashing frame
(gdb) f 13
#13 0x00007f47606cc646 in Java_Crash_printHello (env=0x17d4700, obj=0x18a1ee0) at Crash.c:8
8                strcpy(name,"Ravali Yatham");
d) Dump the contents of name
(gdb) print name
$1 = 0xa0007f47780130e0 <error: Cannot access memory at address 0xa0007f47780130e0>
You can see that the memory is inaccessible which is why program terminated.


2) Application level crashes (exception based):
Let's take below Java program which tries to combine two strings. Trying to access an element which is out of range throws ArrayIndexOutofBoundException, which is not handled in the code and caught at runtime.

Sample Code:
public class AIOBE {
        public void addStrings(String args[]) {
                String result = args[0]+args[2];
                System.out.println("Combination of Strings " +result);
        }
        public void display(String args[]){
                System.out.println("Adding strings");
                addStrings(args);
        }
        public static void main(String args[]) {
                AIOBE test = new AIOBE();
                test.display(args);
        }
}

Compile as: javac AIOBE.java
Test as: java AIOBE "Java"
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: Array index out of range: 2
        at AIOBE.addStrings(AIOBE.java:3)
        at AIOBE.display(AIOBE.java:9)
        at AIOBE.main(AIOBE.java:13)

Let's start understanding exception line by line,
at AIOBE.main(AIOBE.java:13) 
At Line:13, we're calling test.display(args) - which passes command line arguments to display method, which caused error in Line:9

at AIOBE.display(AIOBE.java:9)
At Line:9, we're calling  addNum(args) - which sends same command line arguments passed to display method to another method called addStrings, which led to error at Line:3

at AIOBE.addStrings(AIOBE.java:3)
At Line:3, we're trying to add two strings which are passed over to addStrings from display which are command line arguments. While adding the strings we've hardcoded strings references as
args[0]-- first command line argument
args[2] -- third command line argument --> Error lies here as we're passing only 1 string to main class which is "Java"


We're trying to access an element in the above code which is out of range, i.e we passed only 1 element at 0th index but we tried to fetch 3rd element which is at 2nd index. Hence, we caught an unexpected exception at Runtime as ArrayIndexOutOfBoundsException.


3) Resource usage based crashes (stack or heap overflow)
Let's consider below Java program which calculates factorial of a number. In this example, the recursive method Factorial() calls itself over and over again until it reaches the maximum size of the Java thread stack since a terminating condition is not provided for the recursive calls. When the maximum size of the stack is reached, the program exits with a java.lang.StackOverflowError.

Sample code:
class Factorial {
  static int factorial(int n) {
    return (n * factorial(n-1));
    }    
  public static void main(String args[]){  
    int number=4;
    System.out.println("Factorial of "+number+" is: "+factorial(number));    
 }  
}  

Compile as: java Factorial.java
Test as: java Factorial
Exception in thread "main" java.lang.StackOverflowError
        at Factorial.factorial(Factorial.java:3)
        at Factorial.factorial(Factorial.java:3)
        at Factorial.factorial(Factorial.java:3)
        at Factorial.factorial(Factorial.java:3)

Here we have only one method, In real scenario examine the stacktrace for the repeating pattern of line numbers. After the line of code is identified inspect the code if it has base/terminating condition. If not, code should be fixed. Take a close look at line 3 in the method factorial there isn't any base condition, when should this method return back to the function caller? 

Adding below base condition to the factorial method circumvents the problem:
if (n == 0 || n == 1)
            return 1;

What if
the code has been updated to implement correct recursion but the program still throws a java.lang.StackOverflowError??? The thread stack size can be increased to allow a larger number of invocations. The stack size can be increased by changing the -Xss argument on the JVM, which can be set when starting the application.


Crash related Best Practices

  • Make sure you are using the latest version of every product because there are often many code changes and bug fixes available.
  • Make sure that the required settings for log collection are in place so that when an abnormal situation occurs all the logs are collected for diagnosis / root cause analysis.
  • Make best use of exception handling. Wherever invoking APIs that are designed to throw, identify the right location to catch / absorb / mitigate the exception.
  • There are common exceptions that are unavoidable (such as IOException / SocketException etc.) in a large application. Don't let those percolate further down the stack to become unhandled exceptions. A typical caller of APIs that throw IOException may retry a few times before abandoning the call.
  • Don't ignore crashes by setting up scripts to clean up the dumps and re-spin your application. When we do this, we are potentially ignoring application / configuration / resource issues and making the application highly inefficient.
  • Use JNI with caution. As most of the versatile JVM features around error detection will be missing in the JNI environment, faults occurring there will cause fatal errors to the application and the JVM.
  • Calibrate your application with varying loads and identify the peak memory usage. Setup the heap limits accordingly, so as to avoid crashes due to Java heap exhaustion.
  • Users might encounter crashes due to class cache corruption, clean the shared class cache and restart the application.
  • If you encounter problems with the verifier turned off(-Xverify:none), remove this option and try to reproduce the problem.
  • Make use of javacore information effectively. It is rich with data that represents the internal state of the virtual machine, and can help solve a great share of application anomalies that lead to crash.

#Java
#runtime

Permalink