App Connect

App Connect

Join this online user group to communicate across IBM product users and experts by sharing advice and best practices with peers and staying up to date regarding product enhancements.

 View Only

Useful Techniques for Investigating ACE Performance Issues

By Tim Dunn posted 2 days ago

  

Introduction

Although an App Connect Enterprise (ACE) message flow might be optimally written and configured, it may still experience a variety of performance issues when running and processing messages/requests. This may be because the resources it accesses such as a database, messaging system or an external service (such as an address check or currency conversion service) are running slowly and as a result, the message flow accessing the resource or service will run slowly since it is dependent on the resource to run.

This may well happen without warning. It is important to understand how you can diagnose different types of performance issues. In this article we will look at three different situations:

  1. Regular processing – understanding when a message flow is performing well.
  2. Low message rate due to delays.
  3. Low message rate due to high CPU usage.

We will look at how to diagnose these issues with a series of demos. A set of 4 ACE services running within integration servers will demonstrate a mix of issues and we will walk through how to identify what is happening and how to investigate further.

Tools and data to use

Before we get into the demos we need to talk about the approach and the tools that are used to do the diagnosis.

System level monitoring

To be able to diagnose any performance issue, data is needed at the system level, to show how busy the machine is (CPU, memory & I/O), and from within the application being investigated, so that it is possible to understand what is happening at an application level. In my experience it is always best to start from the top and drill down. Look to see what is happening at the system level before delving into a particular application. Performance might be impacted because someone released a batch job early and that is consuming a lot of CPU, meaning that processing within ACE is impacted. A system level monitoring tool allows you to see if that is happening. The first step is always to identify which application is dominating CPU, memory or I/O activity, whichever one of those resources is experiencing an issue.

For system monitoring, we can use freely available tools like nmon or top which will show system and process level activity. A tool such as vmstat is not much help here as it will show how much CPU and memory is being used at the system level, but it will not show which processes are using the CPU and that is important for this type of performance investigation. In a system where there could be 100’s of processes and 10’s of integration servers then it is essential to know which processes are using CPU and which are not. If CPU usage in the system is very high, then we would want to know which process(es) are using the CPU so that we can focus the investigation on them. Conversely if an ACE application was running slowly then we would want to understand if the integration was using any CPU at all, possibly from the application under investigation or others in that integration server.

Below is an example of a display from a system monitoring tool called nmon (https://nmon.sourceforge.io).

At the top it shows system level CPU activity, split by user (in green) and system (in red) activity. Beneath this section process level activity is shown. There is a line for each process showing the processed identifier (PID), % CPU being used (%CPU Used) – this is the percentage of one CPU. So, a process could use 200% if it was fully using 2 CPUs, the resident memory size in kilobytes(ResSize KB) followed by the command and arguments that invoked the process.

Showing system monitoring
In this display the process list is ordered by CPU usage and the process using the most CPU is process 15442. It is using 56.2 of one CPU, it has a resident memory set size of 412020 KB, or 412 MB and it is running the command IntegrationServer – so it is an ACE integration server – and it has a run directory of /ACE/PDMisc. So, already this is a good start to understanding what is happening. There is some activity on the machine, the system is around 11% CPU busy and the process consuming the most CPU currently is an independent integration server with a work directory of /ACE/PDMisc. This information also tells us what is not happening, say we had another integration server with a run directory of /ACE/Server1 then we can immediately see that it is not using any CPU now. So, if a problem had been reported for Server1 then we would know that it was not a CPU usage problem.

ACE Processes

As the focus of this article is about investigating performance problems with ACE, then it is important to understand which processes can be running when using ACE. The process names are going to be different depending on whether you are running an integration node with associated integration servers, or independent integration servers.

With an integration node the process structure and names and usage are as follows.

Process structure
Usage Name
Availability bipservice
Administration bipbroker
MQTT PubSub bipMQTT
Node wide HTTP listener biphttplistener
ACE Runtime DataflowEngine 


With an independent integration server, the process structure is as follows.

Process Structure
There is a single process which provides all the function needed.
Usage Name
ACE runtime IntegrationServer


ACE runtime execution data

The nmon or top command displays are very helpful in showing process level activity, but they do not allow us to understand what is happening within a process at the application level and for that we need instrumentation within the process. This is always going to be application dependent. In this case we are working with ACE and it provides excellent instrumentation on CPU usage with a feature called accounting and statistics data. This feature, which is enabled by default in ACE v13 onwards, records the invocation of every message flow and captures a wealth of information about the invocation, the time it occurred, the elapsed and CPU times at the message flow etc. and this is done with a low overhead. ACE product trace would be 10-50 times the overhead to show message flow execution and would not capture CPU usage or other key data.

The accounting and statistics data is aggregated into time sliced buckets and published in a variety of formats that are configurable. The data can be aggregated into short-term buckets of 20 seconds duration or long-term buckets ranging from 1 min to 24 hours. The short-term reporting buckets are very useful for understanding very recent runtime behaviour, such as what happened in the last 20 seconds. The long-term buckets can be used for problem determination but are more suited towards capacity planning and charge back as the data is aggregated over longer periods of time.

The focus of this article is not to discuss the feature in any depth but rather how to use it when investigating performance issues with message flow performance. For more information, see Message flow statistics and accounting data

An additional source of information that can help find the cause of performance problems are a set of integration server usage statistics, known as resource statistics. They collect information at an integration server level for each of the resource managers within an integration server, so they would show the number of TCP/IP conversations that there have been, and the number of bytes processed. This data is aggregated at the integration server level so does not show the usage of a particular resource manager by a particular message flow or invocation.

Again, the focus of this article is on the practical investigation of performance problems and not to discuss a product feature in any depth. For more information see Resource statistics.

Time to investigate

In this section we are going to run four different services that run within ACE and look at the performance characteristics of the services and characterise the behaviour of them. In some of the services the performance is not as expected so we will drill in to find out why that is.

The services are all simple message flows and are designed to illustrate specific points. In practice, services are typically much more complex, and, in that case, a single service or message flow might exhibit multiple examples of the issues that we are about to see. Do not assume that in practice there would be a maximum of one issue per service or message flow.

This analysis is done using the tools that we discussed above.  Additionally, we look at the message flow source in some of the cases, when the situation requires it, using the ACE Toolkit to allow us to see the code that is being run.

To simulate user traffic the open source utility Perfharness is used. See https://github.com/ot4i/perf-harness for more details of this utility. It allows messages to be sent over a range of transport protocols (MQ, HTTP, TCP/IP, JMS) at high speed and very efficiently.

Let’s now run each of the services, observe the performance and investigate what is going on.

Service 1

Service 1 is a simple flow that receives an HTTP Input message, invokes a Compute node (ESQL) and sends an HTTP Reply. The schematic of the message flow is as follows:

Message Flow
In the video we learned how to generate some HTTP traffic to invoke a service and observe the performance of the system and of the message flow itself.

Service 2

Service 2 is a simple flow that receives an HTTP Input message, invokes a Compute node (ESQL) and sends an HTTP Reply. The schematic of the message flow is as follows:

Message Flow
In the video we observed that the message flow was running more slowly than expected. The message flow had a high elapsed time but low CPU usage. It was possible to identify the cause of the delay using ACE accounting and statistics data and by looking at the message flow source using the ACE Toolkit.

Service 3

Service 3 is a simple flow that receives an HTTP Input message, invokes a Compute node (ESQL) and sends an HTTP Reply. The schematic of the message flow is as follows:

Message Flow
In the video we observed that the message flow was running more slowly than expected and there was a much higher than expected CPU usage on the system. We were able to identify that the high CPU usage was caused by the integration server hosting Service 3 using nmon. By using ACE accounting and statistics data, we were able to identify that the message flow had a high elapsed time and high CPU usage. So, to resolve the issue, it was necessary to identify which message flow node was using the high amount of CPU. Then by looking at the message flow source, it was possible to identify the code that was causing the high CPU usage and high elapsed time. Although there is no large wait time only, as with service 2, the elapsed time was still high in this case as the node elapsed times includes CPU time so both times were high which is noticeably different from a high elapsed time and low CPU usage figure.

Service 4

Service 4 consisted of two message flows. The first is a client message flow that invokes a service called Server 4 Service. The schematic of these two services is as follows:

Message Flows

In the video we observed that Service 4 was running slowly and had a long elapsed time when invoking a service in an HTTP Request node. The service being invoked was identified as Server 4 Service and the performance of that message flow was identified. It had a high elapsed time and low CPU usage indicating that there was a delay somewhere in the message. As the message flow had a high elapsed time, but low CPU use then the investigation needed to focus on sources of delays rather than high CPU usage. The node causing the delay was identified and by using the ACE Toolkit to look at the message flow source, it was possible to identify the line of code that was causing the issue. 

Conclusion

This article has shown you how to investigate some common performance problems that might be experienced when running ACE message flows using some simple tools and techniques.

The nmon tool allows you to quickly and easily see what is happening at a system level such as how busy the system as a whole is, which processes are using CPU and just as importantly in some cases, identifying those that are not using CPU when you would expect them to be.

The ACE Web UI provides invaluable insight into message flow runtime performance with the visualisation of ACE accounting and statistics snapshot data that is available. This data shows what is happening as performance is being experienced.

Also shown was the importance of distinguishing between the following cases:

  1. Low elapsed times and low CPU usage in all nodes indicates all is well and the message flow can run quickly if there is sufficient CPU available.
  2. High elapsed times and low CPU usage indicates that there is a delay in processing somewhere in the message flow. Use the message flow node level data to determine which node that is and then review the message flow source to identify the cause of the problem.
  3. High elapsed and high CPU usage indicates that the message flow is using a high amount of CPU in processing. If large amounts of data are being processed, this might be as expected, but in many cases, it indicates a problem that you need to find the cause for to improve the efficiency of that code.

The message flows used in this tutorial were designed to illustrate a variety of different issues. There was a single issue per message flow to illustrate the issues more clearly. Note that in practice, message flows are more complex, and a single message flow might have multiple performance issues that need to be resolved.

0 comments
15 views

Permalink