Introduction
Although an App Connect Enterprise (ACE) message flow might be optimally written and configured, it may still experience a variety of performance issues when running and processing messages/requests. This may be because the resources it accesses such as a database, messaging system or an external service (such as an address check or currency conversion service) are running slowly and as a result, the message flow accessing the resource or service will run slowly since it is dependent on the resource to run.
This may well happen without warning. It is important to understand how you can diagnose different types of performance issues. In this article we will look at three different situations:
- Regular processing – understanding when a message flow is performing well.
- Low message rate due to delays.
- Low message rate due to high CPU usage.
We will look at how to diagnose these issues with a series of demos. A set of 4 ACE services running within integration servers will demonstrate a mix of issues and we will walk through how to identify what is happening and how to investigate further.
Tools and data to use
Before we get into the demos we need to talk about the approach and the tools that are used to do the diagnosis.
System level monitoring
To be able to diagnose any performance issue, data is needed at the system level, to show how busy the machine is (CPU, memory & I/O), and from within the application being investigated, so that it is possible to understand what is happening at an application level. In my experience it is always best to start from the top and drill down. Look to see what is happening at the system level before delving into a particular application. Performance might be impacted because someone released a batch job early and that is consuming a lot of CPU, meaning that processing within ACE is impacted. A system level monitoring tool allows you to see if that is happening. The first step is always to identify which application is dominating CPU, memory or I/O activity, whichever one of those resources is experiencing an issue.
For system monitoring, we can use freely available tools like nmon or top which will show system and process level activity. A tool such as vmstat is not much help here as it will show how much CPU and memory is being used at the system level, but it will not show which processes are using the CPU and that is important for this type of performance investigation. In a system where there could be 100’s of processes and 10’s of integration servers then it is essential to know which processes are using CPU and which are not. If CPU usage in the system is very high, then we would want to know which process(es) are using the CPU so that we can focus the investigation on them. Conversely if an ACE application was running slowly then we would want to understand if the integration was using any CPU at all, possibly from the application under investigation or others in that integration server.
Below is an example of a display from a system monitoring tool called nmon (https://nmon.sourceforge.io).
At the top it shows system level CPU activity, split by user (in green) and system (in red) activity. Beneath this section process level activity is shown. There is a line for each process showing the processed identifier (PID), % CPU being used (%CPU Used) – this is the percentage of one CPU. So, a process could use 200% if it was fully using 2 CPUs, the resident memory size in kilobytes(ResSize KB) followed by the command and arguments that invoked the process.
In this display the process list is ordered by CPU usage and the process using the most CPU is process 15442. It is using 56.2 of one CPU, it has a resident memory set size of 412020 KB, or 412 MB and it is running the command IntegrationServer – so it is an ACE integration server – and it has a run directory of /ACE/PDMisc. So, already this is a good start to understanding what is happening. There is some activity on the machine, the system is around 11% CPU busy and the process consuming the most CPU currently is an independent integration server with a work directory of /ACE/PDMisc. This information also tells us what is not happening, say we had another integration server with a run directory of /ACE/Server1 then we can immediately see that it is not using any CPU now. So, if a problem had been reported for Server1 then we would know that it was not a CPU usage problem.