Introduction
Over the last several years, latency, particularly network latency, problems have shown up on IBM Power systems. SAP/HANA is a dominant workload that runs on Power and this blog focuses on that workload, though it need not be limited to such workloads. Extraneous factors could impact network latency as discussed below.
The purpose of this blog is to give IBM Power support personnel the tools to quickly dive into the possibilities to consider in solving network latency problems. Use this blog to obtain hints about the problem areas and dig deeper. The primary focus here is on Linux, but one could extrapolate select techniques to other operating systems as well.
Latency
So, what is latency?
Latency is a measurement of delay in a system. Network latency is the amount of time it takes for data to travel from one point to another across a network. A network with high latency will have slower response times, while a low-latency network will have faster response times.
The first aspect to consider is the physical layout of the two communicating entities, logical partitions (.) in our case. How are these LPARs connected? Are these LPARs on a single system Central Electronics Complex (CEC), or are they two LPARs on different CECs? Are the two LPARs co-located, or are they physically distant? And how are these connected? Are they connected through a bank of switches? All these factors play a role in the network latency, and sometimes we have little to no control on such aspects. These constitute the hardware latency, and sometimes a bit of customer education might be required to set their expectations appropriately.
In this blog, we also show some tools and techniques to get a reasonable estimate of the hardware latency.
Beyond the hardware latency, there are the software latencies of each of the LPARs in use. The software latency is the latency of the entire stack which includes the Linux latency as well as the application latency. Here, we will provide tools to estimate the Linux stack latency as well as application latency.
The nuances of software latencies are discussed further in following sections.
Scenario 1: Application latency
Ping is a latency measurement tool that has been in use for ages. It provides a way to quickly measure latency.
However, ping has limitations. It uses Internet Control Message Protocol (ICMP) packets that can be dropped on a heavily loaded system. Most applications use Transmission Control Protocol (TCP), and ping can’t be used to measure the incremental contributions of TCP latency. Additionally, ping responses are returned from the kernel and can’t be used to measure application latency that might include scheduler latencies as well.
SAP installations typically use niping, a TCP ping equivalent. See Figure 1 below for an example setup using two LPARs on different systems.
Figure 1. A typical two-system setup used in SAP installations
The remote system could be a Linux or an IBM AIX LPAR, and we will illustrate how this can be used to derive the latencies of the various components in the target Linux system, or even the latencies of the switching fabric between the two systems.
We will use the Linux ftrace mechanism to get the latency data. Modern Linux systems come with ftrace built in, so nothing additional needs to be installed, and should be available on customer systems.
Perform the following steps to set up select trace points in the Linux kernel. Tweak them as needed to suit your purpose.
-
Change to the tracing directory.
# cd /sys/kernel/debug/tracing
-
Here we enable the two trace points in the Linux kernel (Rx path).
Note: In Figure 1, we used a slightly different trace point in the Rx path compared to the one illustrated in the following command. When these examples were run, we were looking for driver-independent trace points and thus chose to use napi. Alter the commands to suit your needs.
# echo napi_gro_receive tcp_rcv_established >
set_ftrace_filter
# echo function > current_tracer
-
The format of the trace file is different, depending on the options we select. As we are interested in latency, turn on the corresponding format.
# echo latency-format > trace_options
-
A note of caution, script the next 3 steps.
# echo 1 > tracing_on
In our experiments, we found that usleep 10
worked well. You may experiment with what works best in your environment. The traces could generate a lot of data in a short time, and overwhelm the trace buffer and impact system performance as well.
# usleep 10
-
Turn tracing off to prevent overwrites.
# echo 0 > tracing_on
-
Upon successive runs the trace file gets overwritten, so create a copy.
Here is a sample trace file that we shall analyze further.
# cat trace_result01 | head -30
# tracer: function
#
# function latency trace v1.1.5 on 5.3.18-178.1.INTERNAL_20210823-default
# --------------------------------------------------------------------
# latency: 0 us, #119/119, CPU#17 | (M:server VP:0, KP:0, SP:0 HP:0 #P:32)
# -----------------
# | task: -0 (uid:0 nice:0 policy:0 rt_prio:0)
# -----------------
#
# _------=> CPU#
# / _-----=> irqs-off
# | / _----=> need-resched
# || / _---=> hardirq/softirq
# ||| / _--=> preempt-depth
# |||| / delay
# cmd pid ||||| time | caller
# \ / ||||| \ | /
<idle>-0 0..s. 40044us : napi_gro_receive <-ibmveth_poll
<idle>-0 0..s. 40047us : napi_gro_receive <-ibmveth_poll
<idle>-0 0..s. 40049us+: tcp_rcv_established <-tcp_v4_do_rcv
<idle>-0 0..s. 40070us : napi_gro_receive <-ibmveth_poll
<idle>-0 0..s. 40072us : tcp_rcv_established <-tcp_v4_do_rcv
<idle>-0 0..s. 40075us : napi_gro_receive <-ibmveth_poll
<idle>-0 0..s. 40076us+: tcp_rcv_established <-tcp_v4_do_rcv
<idle>-0 0..s. 40107us : napi_gro_receive <-ibmveth_poll
<idle>-0 0..s. 40108us : tcp_rcv_established <-tcp_v4_do_rcv
<idle>-0 0..s. 40111us : napi_gro_receive <-ibmveth_poll
<idle>-0 0..s. 40112us : tcp_rcv_established <-tcp_v4_do_rcv
<idle>-0 0..s. 40115us+: napi_gro_receive <-ibmveth_poll
<idle>-0 0..s. 40135us : napi_gro_receive <-ibmveth_poll
To compute the Rx latency in the Linux kernel, let us consider the following two sample lines from the above sample:
<idle>-0 0..s. 40047us : napi_gro_receive <-ibmveth_poll
<idle>-0 0..s. 40049us+: tcp_rcv_established <-tcp_v4_do_rcv
In the Rx path, we hit napi_gro_receive()
first before going up the TCP/IP stack, that should be the first in the sequence, followed immediately by tcp_rcv_established()
. That tells us that in this particular case, it took the Linux Rx network stack 2us.
Lines similar to the following should be ignored for our purposes:
<idle>-0 0..s. 40115us+: napi_gro_receive <-ibmveth_poll
<idle>-0 0..s. 40135us : napi_gro_receive <-ibmveth_poll
-
Let us reference Figure 1 and use ftrace to evaluate the Tx latency. On the Tx path, the packet will first hit _tcp_transmit_skb()
followed by mlx5e_xmit()
. From the trace data the difference between mlx5e_xmit()
and _tcp_transmit_skb()
gives us an estimate of the Tx latency.
Given below is an example of an iperf3
run used to estimate Tx latency on a Linux system.
# grep -B 1 mlx5e_xmit trace_results02
iperf3-6764 57.... 70282us : __tcp_transmit_skb <-tcp_cleanup_rbuf
iperf3-6764 57.... 70289us : mlx5e_xmit <-dev_hard_start_xmit
iperf3-6764 57.... 70297us : __tcp_transmit_skb <-tcp_cleanup_rbuf
iperf3-6764 57.... 70299us+: mlx5e_xmit <-dev_hard_start_xmit
iperf3-6764 57.... 70342us : __tcp_transmit_skb <-tcp_cleanup_rbuf
iperf3-6764 57.... 70344us+: mlx5e_xmit <-dev_hard_start_xmit
iperf3-6764 57.... 70389us : __tcp_transmit_skb <-tcp_cleanup_rbuf
iperf3-6764 57.... 70391us+: mlx5e_xmit <-dev_hard_start_xmit
iperf3-6764 57.... 70435us : __tcp_transmit_skb <-tcp_cleanup_rbuf
iperf3-6764 57.... 70437us+: mlx5e_xmit <-dev_hard_start_xmit
iperf3-6764 57.... 70481us : __tcp_transmit_skb <-tcp_cleanup_rbuf
iperf3-6764 57.... 70483us+: mlx5e_xmit <-dev_hard_start_xmit
iperf3-6764 57.... 70528us : __tcp_transmit_skb <-tcp_cleanup_rbuf
iperf3-6764 57.... 70530us+: mlx5e_xmit <-dev_hard_start_xmit
iperf3-6764 57.... 70573us : __tcp_transmit_skb <-tcp_cleanup_rbuf
iperf3-6764 57.... 70575us+: mlx5e_xmit <-dev_hard_start_xmit
...
In the example above, the latency for the packet given by the first two lines is 7us
.
We can extend these techniques to estimate the application latency as well. Referencing Figure 1 again, the latency difference between _tcp_transmit_skb()
and tcp_rcv_established()
gives us an approximate estimate of the application latency in us
.
Similarly, the latency difference between mlx5e_xmit()
and mlx5e_handle_rx_cqe_mpwrq()
gives an estimate of the entire Linux and application latencies. That is, it is a sum of the Rx latency, application latency, and Tx latency.
-
Now that we can estimate the entire Linux latency, extend this to estimate the switching fabric latency. The assumption is that the remote system has similar Rx and Tx latencies as the Linux system.
Revisiting Figure 1 again, you can derive the switching fabric latency as follows:
av2 niping latency of remote system – Rx and Tx latencies of the remote system – Linux system latency = switching fabric latency
Thus, using niping av2 latency and Linux tracing tools, we can isolate the various component latencies. This allows users to isolate the problematic area to a particular component and take a deeper look if necessary.
In one instance, it turned out that the application latency was unexpectedly high. Redesigning some aspects of the application solved the latency problems.
Scenario 2: Over-driving the CPUs
Now, let us switch to the case where you may have the two LPARs on the same CEC, that is, the DB server and the application servers are running on the same system. Because the two LPARs are on the same CEC they use the ibmveth interface for communication.
Here we will use a different set of tools to analyze the problem, specifically nmon. Nigel's Monitor (nmon) is a system performance monitoring tool originally developed by IBM for the AIX operating system and later ported to Linux on several CPU architectures.
The main benefit of nmon is that it allows you to monitor different aspects of your system, such as CPU utilization, memory, disk busy, network utilization, and more, in a single, concise view.
In addition to interactively monitoring your system, you can also use nmon in batch mode to collect and save performance data for analysis.
Figure 2 below shows the CPU utilization of a sample DB server. Under peak load, the CPU utilization crosses 40% and sporadically may be even higher.
And here is a nmon graph for the corresponding application server.
As you can observe, under load the application servers are almost pegged at a 100% CPU utilization. Typical SAP/HANA applications are fairly network traffic intensive. It is the same set of CPUs that process application data as well as process network traffic. In this case, because the application servers are pegged at a 100% CPU utilization, it is leaving little room to process network traffic. This may result in packet drops and thus retransmits (for TCP, which is almost always the case) further adding to the latency.
Here are some things to consider:
- If the servers are using shared processor pools, you could switch to use dedicated processors and reevaluate.
- Figures 2 and 3 depict the aggregate CPU utilization. Using nmon, it is possible to dig deeper into the individual core utilization and assess if setting up CPU affinity to application threads might help alleviate the CPU utilization.
Switching to dedicated cores alleviated the problem of over-driving the CPUs and lowered the network latency to an acceptable level. There could be additional tuning considerations factored in as well. Refer to the Tuning Guide for more details.
Scenario 3: VIOS configuration
VIOS configurations can indirectly impact Linux LPARs, and network latency as well, say when virtual Ethernet is used to connect two systems through the shared Ethernet adapter (SEA) interface. The scenario described earlier can apply here as well. Ensure that cores assigned to VIOS are not over-committed.
Other aspects to consider:
Summary
In this blog, we discussed the various aspects that could impact Linux network latency. Scenario 1 explained how to attribute latency to the various components. Scenario 2 described what happens when CPUs are over-subscribed and scenario 3 provided suggestions to optimize VIOS configuration.
Acknowledgment
Thanks to Marvin Heffler for his review, comments, and continued support through the course of this work.