In the previous blog post, network latency problems with SAP/HANA workload, we provided techniques to get an initial estimate of the latency of the Linux networking stack.
We made some assumptions in that blog while computing the latency. That was good enough for a first level approximation. In this blog, we remove those assumptions and provide techniques to derive the latency of each packet.
First level approximation of latency
Here is a snippet from the previous blog:
<idle>-0 0..s. 40044us : napi_gro_receive <-ibmveth_poll
<idle>-0 0..s. 40047us : napi_gro_receive <-ibmveth_poll
<idle>-0 0..s. 40049us+: tcp_rcv_established <-tcp_v4_do_rcv
<idle>-0 0..s. 40070us : napi_gro_receive <-ibmveth_poll
<idle>-0 0..s. 40072us : tcp_rcv_established <-tcp_v4_do_rcv
|
The assumption was that the second occurrence of napi_gro_receive() and the first occurrence of tcp_rcv_established() were part of the same stream. It means that we assumed the following snippet to be part of the same Transmission Control Protocol (TCP) packet and computed the receive latency as 2us.
<idle>-0 0..s. 40047us : napi_gro_receive <-ibmveth_poll
<idle>-0 0..s. 40049us+: tcp_rcv_established <-tcp_v4_do_rcv
|
This was good as a first level approximation, but there was no way to correlate that these traces were for the same TCP packet. In this blog, we will dive deep into the Linux tracing techniques to ascertain just that.
Multiple TCP sessions
We will start by using netperf to generate traffic for a single TCP Request & Response (RR) session. Let us use a variant of the tracing techniques described in the previous blog, network latency problems with SAP/HANA workload, and start with trace-cmd. Following is a snippet of the single TCP RR session:
<idle>-0 21..s. 253959.953536382: (+25254) function: napi_gro_receive [36950:0xf00:32]
netperf-11096 22.... 253959.953544606: (+8224) function: tcp_sendmsg [33478:0x2460:32]
netperf-11096 22.... 253959.953546656: (+2050) function: mlx5e_xmit [2050:0x2484:32]
netperf-11096 22.... 253959.953548450: (+1794) function: tcp_recvmsg [1794:0x24a8:32]
Is tcp_rcvmsg() handling the same packet that was seen earlier in napi_gro_receive()? Why does tcp_sendmsg() precede tcp_rcvmsg()?
<idle>-0 21..s. 253959.953573998: (+25548) function: napi_gro_receive [37616:0xf24:32]
netperf-11096 22.... 253959.953581988: (+7990) function: tcp_sendmsg [33538:0x24cc:32]
netperf-11096 22.... 253959.953583916: (+1928) function: mlx5e_xmit [1928:0x24f0:32]
netperf-11096 22.... 253959.953585854: (+1938) function: tcp_recvmsg [1938:0x2514:32]
<idle>-0 21..s. 253959.953611162: (+25308) function: napi_gro_receive [37164:0xf48:32]
netperf-11096 22.... 253959.953619462: (+8300) function: tcp_sendmsg [33608:0x2538:32]
netperf-11096 22.... 253959.953621436: (+1974) function: mlx5e_xmit [1974:0x255c:32]
netperf-11096 22.... 253959.953623146: (+1710) function: tcp_recvmsg [1710:0x2580:32]
<idle>-0 21..s. 253959.953647910: (+24764) function: napi_gro_receive [36748:0xf6c:32]
netperf-11096 22.... 253959.953655998: (+8088) function: tcp_sendmsg [32852:0x25a4:32]
|
Some interesting questions that crop up looking at the trace data in the snippet, are:
- As the inline commentary mentions, why did tcp_sendmsg() precede tcp_recvmsg()?
- Are there multiple TCP connections even if we only ran a single TCP RR session?
Let us take a step back and look at the TCP sessions on an idle system. Maybe that will provide some clues. Following is an example of an idle LPAR with a netperf server running but with no traffic flowing.
# ss -lt
State Recv-Q Send-Q Local Address:Port Peer Address:Port Process
LISTEN 0 5 127.0.0.1:44321 0.0.0.0:*
LISTEN 0 10 0.0.0.0:tvdumtray-port 0.0.0.0:*
LISTEN 0 10 0.0.0.0:nut 0.0.0.0:*
LISTEN 0 128 0.0.0.0:ssh 0.0.0.0:*
LISTEN 0 4096 0.0.0.0:sunrpc 0.0.0.0:*
LISTEN 0 1024 *:netperf *:*
LISTEN 0 5 [::1]:44321 [::]:*
LISTEN 0 128 [::]:ssh [::]:*
LISTEN 0 4096 [::]:sunrpc [::]:*
LISTEN 0 128 *:rmc *:*
|
Note: The entry *:netperf in the snippet is the netserver daemon listening for connections. The ss -lt command was run on the server side of a netperf connection.
With that, we can answer both the questions presented earlier. It is now obvious that there could be other TCP sessions even if we are not actively running a netperf client against the server. That answers question #2.
Now we can answer question #1 also. It is very likely that one of the other TCP connections performed tcp_sendmsg() just before the tcp_recvmsg() of interest.
This brings up another interesting question – how can we distinguish between packets of different TCP connections? We will explore that next.
Perf probe
Each packet is assigned an skb by the Linux networking stack. We can use skb address to distinguish between TCP connections, but we need to use perf probe to do that. The perf probe command requires debuginfo packages to be installed. Next, verify if that is installed on the system:
# rpm -qa | grep kernel-debug*
kernel-debuginfo-common-ppc64le-5.14.0-427.13.1.el9_4.ppc64le
kernel-debug-debuginfo-5.14.0-427.13.1.el9_4.ppc64le
kernel-debuginfo-5.14.0-427.13.1.el9_4.ppc64le
|
This confirms that the necessary packages are installed.
Print the skb address at the entry point of the following four functions:
- napi_gro_receive()
- tcp_v4_rcv()
- __tcp_transmit_skb()
- __dev_queue_xmit()
This is similar to what we used previously – one each at the bottom and top of the receive stack, as well as on the transmit side.
# perf probe --add 'napi_gro_receive:0 skb' --add 'tcp_v4_rcv:0 skb' --add '__tcp_transmit_skb:0 skb' --add '__dev_queue_xmit:0 skb'
Added new events:
probe:napi_gro_receive (on napi_gro_receive with skb)
probe:tcp_v4_rcv (on tcp_v4_rcv with skb)
probe:__tcp_transmit_skb (on __tcp_transmit_skb with skb)
probe:__dev_queue_xmit (on __dev_queue_xmit with skb)
You can now use it in all perf tools, such as:
perf record -e probe:__dev_queue_xmit -aR sleep 1
#
|
Let us confirm the probe points are in place, as expected.
# perf probe -l
probe:__dev_queue_xmit (on __dev_queue_xmit:1@net/core/dev.c with skb)
probe:__tcp_transmit_skb (on __tcp_transmit_skb:2@net/ipv4/tcp_output.c with skb)
probe:napi_gro_receive (on napi_gro_receive:1@net/core/gro.c with skb)
probe:tcp_v4_rcv (on tcp_v4_rcv:1@net/ipv4/tcp_ipv4.c with skb)
|
When we run netperf, we print the skb addresses as shown in the following snippet:
swapper 0 [011] 68284.651868: probe:napi_gro_receive: (c000000000d3bdb8) skb=0xc00000000df20000 swapper 0 [011] 68284.651870: probe:tcp_v4_rcv: (c000000000e18dc8) skb=0xc00000000df25600 ksoftirqd/11 71 [011] 68284.651878: probe:napi_gro_receive: (c000000000d3bdb8) skb=0xc00000000df27400 ksoftirqd/11 71 [011] 68284.651880: probe:tcp_v4_rcv: (c000000000e18dc8) skb=0xc00000000df27400 netserver 61468 [011] 68284.651889: probe:__tcp_transmit_skb: (c000000000e09918) skb=0xc000000067961900 netserver 61468 [011] 68284.651891: probe:__dev_queue_xmit: (c000000000cedd28) skb=0xc000000067961900 swapper 0 [000] 68284.651910: probe:napi_gro_receive: (c000000000d3bdb8) skb=0xc00000000df26f00 swapper 0 [000] 68284.651911: probe:napi_gro_receive: (c000000000d3bdb8) skb=0xc00000000df2d500 swapper 0 [000] 68284.651913: probe:tcp_v4_rcv: (c000000000e18dc8) skb=0xc00000000df26f00 swapper 0 [000] 68284.651917: probe:__tcp_transmit_skb: (c000000000e09918) skb=0xc000000006dca800 swapper 0 [000] 68284.651918: probe:__dev_queue_xmit: (c000000000cedd28) skb=0xc000000006dca800 swapper 0 [000] 68284.651925: probe:napi_gro_receive: (c000000000d3bdb8) skb=0xc00000000df2f400 |
From the skb addresses, we can state that the first and third entries highlighted in the snippet belong to the same TCP stream. The middle napi_gro_receive(), could belong to a different connection.
This is good progress, but can we correlate the send and receive sides of a single TCP connection? The send and receive sides use different skbs, so clearly skb addresses are not enough. Let us explore that in the next section.
Mark skb
The solution to the question in the previous section is to mark the skbs. Let us illustrate that with an example. Lp3 and lp4 are different logical partitions (LPARs) as illustrated in Figure 1.
Figure 1: Logical partitions
Run the following command on the netperf server to mark packets on egress with 0xdeadbeef:
# iptables -t raw -A OUTPUT -d 9.xx.yy.249 -j MARK --set-mark 0xdeadbeef
|
Run the following command on the netperf server to mark packets on ingress with 0xdeadbeef:
#iptables -t raw -A PREROUTING -s 9.xx.yy.249 -j MARK --set-mark 0xdeadbeef
|
Delete the probes that were inserted previously.
# perf probe --del probe:__dev_queue_xmit --del probe:__tcp_transmit_skb --del
probe:napi_gro_receive --del probe:tcp_v4_rcv
Removed event: probe:__dev_queue_xmit
Removed event: probe:__tcp_transmit_skb
Removed event: probe:napi_gro_receive
Removed event: probe:tcp_v4_rcv
#
|
Next, reinsert the probes on the Rx side:
# perf probe --add 'napi_gro_receive:0 skb skb->mark' --add 'tcp_v4_rcv:0 skb skb->mark'
Added new events:
probe:napi_gro_receive (on napi_gro_receive with skb mark=skb->mark)
probe:tcp_v4_rcv (on tcp_v4_rcv with skb mark=skb->mark)
You can now use it in all perf tools, such as:
perf record -e probe:tcp_v4_rcv -aR sleep 1
#
|
Reinsert the probes on the Tx side:
# perf probe --add '__tcp_transmit_skb:0 skb skb->mark' --add '__dev_queue_xmit skb skb->mark'
Added new events:
probe:__tcp_transmit_skb (on __tcp_transmit_skb with skb mark=skb->mark)
probe:__dev_queue_xmit (on __dev_queue_xmit with skb mark=skb->mark)
You can now use it in all perf tools, such as:
perf record -e probe:__dev_queue_xmit -aR sleep 1
#
|
Confirm that the inserted probes are indeed in place.
# perf probe -l
probe:__dev_queue_xmit (on __dev_queue_xmit:1@net/core/dev.c with skb mark)
probe:__tcp_transmit_skb (on __tcp_transmit_skb:2@net/ipv4/tcp_output.c with skb mark)
probe:napi_gro_receive (on napi_gro_receive:1@net/core/gro.c with skb mark)
probe:tcp_v4_rcv (on tcp_v4_rcv:1@net/ipv4/tcp_ipv4.c with skb mark)
|
# perf script > report01
#
swapper 0 [000] 72937.046385: probe:napi_gro_receive: (c000000000d3bdb8) skb=0xc00000006cb37900 mark=0x0
swapper 0 [000] 72937.046387: probe:tcp_v4_rcv: (c000000000e18dc8) skb=0xc00000006cb33100 mark=0xdeadbeef
swapper 0 [000] 72937.046393: probe:napi_gro_receive: (c000000000d3bdb8) skb=0xc00000006cb3a100 mark=0x0
swapper 0 [000] 72937.046395: probe:tcp_v4_rcv: (c000000000e18dc8) skb=0xc00000006cb3a100 mark=0xdeadbeef
netserver 63929 [001] 72937.046398: probe:__tcp_transmit_skb: (c000000000e09918) skb=0xc00000005e9eae00 mark=0x0
netserver 63929 [001] 72937.046400: probe:__dev_queue_xmit: (c000000000cedd28) skb=0xc00000005e9eae00 mark=0xdeadbeef
swapper 0 [003] 72937.046428: probe:napi_gro_receive: (c000000000d3bdb8) skb=0xc00000006cb3e400 mark=0x0
… |
Let us focus on the four highlighted lines in the snippet. As discussed earlier, napi_gro_receive() and tcp_v4_rcv() have the same skb address. Look at the skb address of __dev_queue_xmit() which has the mark of 0xdeadbeef. It is the same skb address as the previous _tcp_transmit_skb(), so we can infer it is the same packet. Using the marker of 0xdeadbeef, we can link the Rx and Tx sides. We can also assert that both Rx and Tx latencies in this particular case were 2 Us.
There is still one question that needs to be addressed. Why does the packet have mark=0x0 at napi_gro_receive() and __tcp_transmit_skb() and not 0xdeadbeef?
Recollect that we used iptables to mark the skb. On the packet ingress, when we hit napi_gro_receive(), we did not go up to the IP layer. Thus, the skb has not been marked and so it will still be set to 0. A similar logic applies on the egress side; the packet is still at the TCP layer and has not reached the IP layer.
Summary
This blog provided techniques to identify packets from different TCP streams. We no longer need to make the assumptions made in the previous blog. We were able to make that assertion based on skb addresses and skb markings visible in the perf reports.