Author: @Bharath Srinivas
Co-Authored By: @GIREESH PUNATHIL
Introduction
In part 1 of this Java application monitoring series, Java Monitoring 101: Basic Concepts, we introduced essential ingredients in JVM monitoring, metrics and their constituent events, and mechanisms to capture those events. In part 2 of this series, Java Monitoring 101: Tools, we introduced common monitoring tools that are used in the Java ecosystem, their advantages and challenges and how they can be used effectively to monitor Java applications. In this article, we will dive into how we can utilize these monitoring tools and practices to debug common problems and issues our cloud-native Java applications may face. We’ll explore the following problems:
-
Performance monitoring
-
Memory leaks
-
Freezes or Hangs
-
Loops
-
Crashes
Debugging Problems using monitoring
While the main aim of monitoring is to improve the performance and reliability of your application, these tools help in detection of common production issues as well. Lets take a look at 5 of the most common, basic problems that can identified through monitoring and how they can be resolved.
-
Performance Monitoring
One of the key aspects of monitoring is to find any performance related issues. For this, we monitor key performance metrics like response time, throughput, CPU, memory usage, and error rates.
Steps to monitor performance in your application:
-
The first step is to define a baseline criteria for all the metrics, which defines the key performance metrics when the application works in a normal environment. It serves as reference point for comparison with metrics collected at the problem state.
-
When a performance issue is suspected, profile these metrics
-
Compare each of the metrics with the corresponding values in the baseline
-
Evaluate the differences and isolate the anomalous metric(s)
-
Once this has been identified, use tools or methods specific to the anomaly to diagnose and rectify the issue
There are several enterprise grade application performance monitoring tools that are available, a few examples are Instana, DataDog , LogRocket.
-
Memory leak
High memory usage results in frequent GC cycles, often memory swapping between OS and disk and reduced cache efficiency which affects the overall performance of the application. So, it is very important to monitor the memory that an application is using. At times, the application uses memory more than it is intended to use. Such a condition is referred to as memory leaks. A memory leak is the term used to describe the excess memory that the application is using.
Steps to monitor memory leaks in your application:
-
Like in performance monitoring define a baseline memory usage.
-
When a leak is suspected, collect a heap snapshot or a heap dump (Different Ways to Capture Java Heap Dumps - GeeksforGeeks) that will give an idea of how much memory is allocated to all the objects.
-
Next, locate the Java object or objects that show inappropriate memory growth.
-
Walk the dominator tree (hierarchy of the objects to find the root object) of each such objects to find who is responsible for holding that object.
-
Identify such objects in the codebase and analyze why those are growing or leaking.
-
Repeat this process for all the growing objects.
There are several enterprise grade application memory leak detection tools that are available, examples include Eclipse MAT and Valgrind.
-
Freeze or Hang
The term ‘freeze’ refers to the scenario where an application uses zero or near-zero amount of CPU, which means the application has stopped working where it was expected to do some meaningful work.
Steps to monitor Freeze conditions in your application:
-
The application internally runs on separate threads to do their tasks. So, use thread profile to collect thread information (Capturing a Java Thread Dump | Baeldung).
-
Analyse each thread to figure out their internal state: running, waiting, blocked etc.
-
A thread may be waiting for another thread, so identify the relations between the threads by walking through the entire list of waiting threads.
-
analyse the target thread to figure out its internal state.
-
iterate the whole process until you find one or more threads that are not waiting but are responsible for the whole other threads to chain up in a wait queue.
-
map it back to the code to identify the root cause or reduce the bottleneck
There are several enterprise grade application freeze detection tools that are available, few examples are jstack, visualvm, and gdb.
-
Loop
A high CPU can occur when there is a tight loop in the application. Tight loop is is a condition where the code gets stuck in a loop where the termination condition was incorrectly computed, due to which the code keeps executing inside the loop.
Steps to monitor Loop condition in your application:
-
List all the threads that are running in the application.
-
Find the ones that are using abnormal amount of CPU.
-
Collect CPU profiles for those threads.
-
Examine CPU profile of each thread to see where the time is spent.
-
Map it back to the application code and figure out the reason for the loop.
There are several enterprise grade application loop detection tools that are available, examples include New Relic and AppDynamics.
-
Crash
A ‘crash’ is an unexpected termination of a running application due to errors or malfunctions in the application code or in the JVM.
Steps to monitor Crash condition in your application:
-
Collect a crash dump at the crash site.
-
Use postmortem debugging tools like GDB(GDB (Step by Step Introduction) - GeeksforGeeks).
-
List the crashing thread.
-
List the crashing instructions.
-
Traceback both code and data using info on memory, call stack, and register values.
-
Identify the source of anomalous code or anomalous data.
There are several enterprise grade application crash diagnostic tools that are available, examples include Sentry or you can also use the core dump and log files to deal with crash issues.
Summary and next steps:
In this article, we have explored some of the common problems our cloud-native Java applications may face that can be solved using monitoring. We covered a wide variety of problems including performance, memory crashes, high/low CPU and the respective steps to solve those. However, it’s worth bearing in mind that we did not cover a fully comprehensive list of problems that can be solved using monitoring, there are additional problems that can be helped with monitoring like error rates, throughput, I/O bottlenecks etc. that aren’t covered in this article.
Although it’s been useful identifying the tools we can use for effective monitoring of our Java applications and exploring the problems they can help us to solve, we need to ensure we are being efficient with our monitoring and enabling us to get the most out of our monitoring data. So, in the final article of this series, Java Monitoring 101: Best Practices, we will look at some monitoring best practices to make our monitoring more efficient and useful.
If you’re interested in continuing your learning of monitoring, check out this free, in-depth learning course on edx: [Monitoring and Observability for Application Developers](https://www.edx.org/learn/devops/ibm-monitoring-and-observability-for-application-developers). This course provides a comprehensive overview of monitoring and observability, and teaches you the hands-on skills to employ monitoring, observability, and logging for your application.
Alternatively, take a look at the interactive guides available on the Open Liberty website that take you through how to use further monitoring/observability tools for your Java applications at the application level: https://openliberty.io/guides/#observability.