1) Native crashes (signal based):
Let's consider a Java program that makes a JNI call to the native (C) method that has a buggy code which terminates the program with a segmentation error. Complete details on how to execute the JNI program is illustrated here https://github.com/yathamravali/JNIDemo
Below is the native code which tries to copy a string to name that exceeds the size it can actually hold. More specifically, we have a character pointer `name` which is pointing to a dynamically allocated memory in heap of fixed size i.e 8 bytes. Now in line 8, strcpy library call is trying to put 13 byte string literal to a 8 byte memory area. Obviously, this is bad, so at runtime, theprogram terminated. As the code belongs to JNI and is outside the control of the Java Virtual Machine, no bound checks and other runtime error checking facility are available from the VM.
Sample code:
#include "Crash.h"
#include "jni.h"
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
JNIEXPORT void JNICALL Java_Crash_printHello(JNIEnv *env, jobject obj){
char *name=(char *)malloc(8);
strcpy(name,"Ravali Yatham");
return;
}
Compile the above code as shown below to build a share library
> gcc -fPIC -g -I/home/ravali/Java8SR6/include -o libCrash.so -shared Crash.c
Here -g option was used to include debug information in the generated share library and its content.
Note: In general the JVM libraries that are shipped with the jre build doesn’t have debug info, debug files needs to be included separately while loading into native debuggers to get the line numbers.
Now run the java program, IBM JVM has the dump agent enabled by default for gpf event which will generate all the artifacts if required OS settings are in place
> java -Djava.library.path=. Crash
Unhandled exception
Type=Segmentation error vmState=0x00040000
J9Generic_Signal_Number=00000018 Signal_Number=0000000b Error_Value=00000000 Signal_Code=00000001
Handler1=00007F477C3AC7D0 Handler2=00007F4777B9F670 InaccessibleAddress=FFFFFFFFFFFFFFA0
RDI=0000000000000000 RSI=0000000000000000 RAX=FFFFFFFFFFFFFFA0 RBX=0000000000000010
RCX=00007F4778000020 RDX=5920696C61766152 R8=00007F47780008D0 R9=00007F477DEB0C40
R10=0000000000000000 R11=0000000000000000 R12=0000000000000000 R13=00007F477C47CCCC
R14=00007F477D30B700 R15=0000000000000000
RIP=00007F47606CC646 GS=0000 FS=0000 RSP=00007F477D30B400
Module=./libCrash.so
Module_base_address=00007F47606CC000 Symbol=Java_Crash_printHello
Symbol_address=00007F47606CC61A
Target=2_90_20191106_432135 (Linux 4.15.0-188-generic)
CPU=amd64 (4 logical CPUs) (0x1f27fd000 RAM)
----------- Stack Backtrace -----------
Java_Crash_printHello+0x2c (0x00007F47606CC646 [libCrash.so+0x646])
(0x00007F477C44E314 [libj9vm29.so+0x141314])
(0x00007F477C44BA37 [libj9vm29.so+0x13ea37])
(0x00007F477C339384 [libj9vm29.so+0x2c384])
(0x00007F477C326100 [libj9vm29.so+0x19100])
(0x00007F477C3E7A12 [libj9vm29.so+0xdaa12])
---------------------------------------
Standarderror message in the console output contains minimal information regarding the fault such as the register info and the module in which crash happened. If you look closely at the stack backtrace above, some of the frames have method names unresolved for the library libj9vm29.so. This is because those libraries are not built with debug flag. This is whereNative debugger helps resolve the method names based on the library base address and offset. We can even get the line numbers of crashing method with debug symbols included.
Now lets focus on debugging, Below are the discrete steps that needs to be followed in order:
a) Load coredump to gdb debugger
(gdb) exec-file /root/Java8SR6/bin/java
(gdb) core core.20220707.041602.17619.0001.dmp
b) Print backtrace
(gdb) where
#12 <signal handler called>
#13 0x00007f47606cc646 in Java_Crash_printHello (env=0x17d4700, obj=0x18a1ee0) at Crash.c:8
#14 0x00007f477c44e314 in ffi_call_unix64 () at x86/unix64.S:76
#15 0x00007f477c44ba37 in ffi_call (cif=<optimized out>, fn=<optimized out>, rvalue=<optimized out>, avalue=<optimized out>) at x86/ffi64.c:525
#16 0x00007f477c339384 in VM_BytecodeInterpreter::cJNICallout (isStatic=<optimized out>, function=<optimized out>, returnStorage=<optimized out>, returnType=<optimized out>, javaArgs=<optimized out>,
receiverAddress=0x18a1ee0, _pc=<optimized out>, _sp=<optimized out>, this=<optimized out>) at BytecodeInterpreter.hpp:2417
#17 VM_BytecodeInterpreter::callCFunction (returnType=<optimized out>, isStatic=<optimized out>, bp=<optimized out>, javaArgs=<optimized out>, receiverAddress=<optimized out>,
jniMethodStartAddress=<optimized out>, _pc=<optimized out>, _sp=<optimized out>, this=<optimized out>) at BytecodeInterpreter.hpp:2257
#18 VM_BytecodeInterpreter::runJNINative (_pc=<optimized out>, _sp=<optimized out>, this=<optimized out>) at BytecodeInterpreter.hpp:2149
#19 VM_BytecodeInterpreter::run (this=0x0, this@entry=0x7f477d30b8c0, vmThread=0xffffffffffffffa0) at BytecodeInterpreter.hpp:9548
#20 0x00007f477c326100 in bytecodeLoop (currentThread=<optimized out>) at BytecodeInterpreter.cpp:109
#21 0x00007f477c3e7a12 in c_cInterpreter () at xcinterp.s:160
#22 0x00007f477c398f28 in runCallInMethod (env=0x7f477d30b9d0, receiver=<optimized out>, clazz=0x18a1f50, methodID=0x7f477843c278, args=0x7f477d30bd88) at callin.cpp:1083
#23 0x00007f477c3afcb9 in gpProtectedRunCallInMethod (entryArg=0x7f477d30bd40) at jnicsup.cpp:258
#24 0x00007f4777ba03d3 in omrsig_protect (portLibrary=0x7f477cb083a0 <j9portLibrary>, fn=0x7f477c3f0bf0 <signalProtectAndRunGlue>, fn_arg=0x7f477d30bce0,
handler=0x7f477c3ac7d0 <structuredSignalHandler>, handler_arg=0x17d4700, flags=506, result=0x7f477d30bcd8) at ../../omr/port/unix/omrsignal.c:425
#25 0x00007f477c3f0c8c in gpProtectAndRun (function=0x7f477c3afc80 <gpProtectedRunCallInMethod(void*)>, env=0x17d4700, args=0x7f477d30bd40) at jniprotect.c:78
#26 0x00007f477c3b15ff in gpCheckCallin (env=0x17d4700, receiver=receiver@entry=0x0, cls=0x18a1f50, methodID=0x7f477843c278, args=args@entry=0x7f477d30bd88) at jnicsup.cpp:441
#27 0x00007f477c3af68a in callStaticVoidMethod (env=<optimized out>, cls=<optimized out>, methodID=<optimized out>) at jnicgen.c:288
#28 0x00007f477e0ca2cb in JavaMain () from /root/Java8SR6/bin/../lib/amd64/jli/libjli.so
#29 0x00007f477e2e36db in start_thread (arg=0x7f477d30c700) at pthread_create.c:463
#30 0x00007f477dbe671f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
Backtrace helps identify the calling sequence that led to the crash. Each line in the backtrace represents one method frame - the data associated with call to one function. The frame contains the arguments given to the function, the function's local variables, and the address at which the function is executing.
From above, frame 13 is where crash happened
c) Dump the crashing frame
(gdb) f 13
#13 0x00007f47606cc646 in Java_Crash_printHello (env=0x17d4700, obj=0x18a1ee0) at Crash.c:8
8 strcpy(name,"Ravali Yatham");
d) Dump the contents of name
(gdb) print name
$1 = 0xa0007f47780130e0 <error: Cannot access memory at address 0xa0007f47780130e0>
You can see that the memory is inaccessible which is why program terminated.
2) Application level crashes (exception based):
Let's take below Java program which tries to combine two strings. Trying to access an element which is out of range throws ArrayIndexOutofBoundException, which is not handled in the code and caught at runtime.
Sample Code:
public class AIOBE {
public void addStrings(String args[]) {
String result = args[0]+args[2];
System.out.println("Combination of Strings " +result);
}
public void display(String args[]){
System.out.println("Adding strings");
addStrings(args);
}
public static void main(String args[]) {
AIOBE test = new AIOBE();
test.display(args);
}
}
Compile as: javac AIOBE.java
Test as: java AIOBE "Java"
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: Array index out of range: 2
at AIOBE.addStrings(AIOBE.java:3)
at AIOBE.display(AIOBE.java:9)
at AIOBE.main(AIOBE.java:13)
Let's start understanding exception line by line,
at AIOBE.main(AIOBE.java:13)
At Line:13, we're calling test.display(args) - which passes command line arguments to display method, which caused error in Line:9
at AIOBE.display(AIOBE.java:9)
At Line:9, we're calling addNum(args) - which sends same command line arguments passed to display method to another method called addStrings, which led to error at Line:3
at AIOBE.addStrings(AIOBE.java:3)
At Line:3, we're trying to add two strings which are passed over to addStrings from display which are command line arguments. While adding the strings we've hardcoded strings references as
args[0]-- first command line argument
args[2] -- third command line argument --> Error lies here as we're passing only 1 string to main class which is "Java"
We're trying to access an element in the above code which is out of range, i.e we passed only 1 element at 0th index but we tried to fetch 3rd element which is at 2nd index. Hence, we caught an unexpected exception at Runtime as ArrayIndexOutOfBoundsException.
3) Resource usage based crashes (stack or heap overflow)
Let's consider below Java program which calculates factorial of a number. In this example, the recursive method Factorial() calls itself over and over again until it reaches the maximum size of the Java thread stack since a terminating condition is not provided for the recursive calls. When the maximum size of the stack is reached, the program exits with a java.lang.StackOverflowError.
Sample code:
class Factorial {
static int factorial(int n) {
return (n * factorial(n-1));
}
public static void main(String args[]){
int number=4;
System.out.println("Factorial of "+number+" is: "+factorial(number));
}
}
Compile as: java Factorial.java
Test as: java Factorial
Exception in thread "main" java.lang.StackOverflowError
at Factorial.factorial(Factorial.java:3)
at Factorial.factorial(Factorial.java:3)
at Factorial.factorial(Factorial.java:3)
at Factorial.factorial(Factorial.java:3)
Here we have only one method, In real scenario examine the stacktrace for the repeating pattern of line numbers. After the line of code is identified inspect the code if it has base/terminating condition. If not, code should be fixed. Take a close look at line 3 in the method factorial there isn't any base condition, when should this method return back to the function caller?
Adding below base condition to the factorial method circumvents the problem:
if (n == 0 || n == 1)
return 1;
What if the code has been updated to implement correct recursion but the program still throws a java.lang.StackOverflowError??? The thread stack size can be increased to allow a larger number of invocations. The stack size can be increased by changing the -Xss argument on the JVM, which can be set when starting the application.
Crash related Best Practices
- Make sure you are using the latest version of every product because there are often many code changes and bug fixes available.
- Make sure that the required settings for log collection are in place so that when an abnormal situation occurs all the logs are collected for diagnosis / root cause analysis.
- Make best use of exception handling. Wherever invoking APIs that are designed to throw, identify the right location to catch / absorb / mitigate the exception.
- There are common exceptions that are unavoidable (such as IOException / SocketException etc.) in a large application. Don't let those percolate further down the stack to become unhandled exceptions. A typical caller of APIs that throw IOException may retry a few times before abandoning the call.
- Don't ignore crashes by setting up scripts to clean up the dumps and re-spin your application. When we do this, we are potentially ignoring application / configuration / resource issues and making the application highly inefficient.
- Use JNI with caution. As most of the versatile JVM features around error detection will be missing in the JNI environment, faults occurring there will cause fatal errors to the application and the JVM.
- Calibrate your application with varying loads and identify the peak memory usage. Setup the heap limits accordingly, so as to avoid crashes due to Java heap exhaustion.
- Users might encounter crashes due to class cache corruption, clean the shared class cache and restart the application.
- If you encounter problems with the verifier turned off(-Xverify:none), remove this option and try to reproduce the problem.
- Make use of javacore information effectively. It is rich with data that represents the internal state of the virtual machine, and can help solve a great share of application anomalies that lead to crash.