BPM, Workflow, and Case

 View Only

Diagnosing a network issue in the OpenShift network layer - Business Automation Workflow

By Stephan Volz posted Tue March 16, 2021 09:25 AM

  

0) Introduction

Why do I want to debug the OpenShift network layer? During one of our tests I could see socket timeout exceptions of pods. When tracking the sending pod one could see that the packet should have been sent out, but did not show on the receiving pod. Best way to track was to enable tcpdump on the OpenShift node (the details will be described below). With this information I could see packet retransmissions ... The door for further analysis had been opened.

When working with the network on OpenShift one should be aware, that network communication on the same node is executed via Open vSwitch which is a virtual switch. To simplify testing locating pods on the same node only can be beneficial, else additional components of the SDN network need to be considered during debugging as well.

1) The tools required

As already mentioned in section 0) tcpdump was a very essential tool for debugging the issue. I will describe the steps performed in the case you might find some similar issue in the future and want to know what we did.

1.1) Run tcpdump on the node or the pod?

The answer to the question depends. To get a quick overview most likely the data collection on the node will be better suited. Especially if you are not aware where the issue is actually located. If you want to dig deeper and already identified where issues are observed, it can be required to do the tcpdump on the pods and the OpenShift network layer.

1.2) Running tcpdump on a node

The following is a streamlined version of the Red Hat instructions, see the details here: https://access.redhat.com/solutions/5074041

First you need to connect to the node in debug mode:

$ oc get nodes
$ oc debug -t --image registry.redhat.io/rhel8/support-tools:latest node/${NODE_NAME}

Define the output file name and start the tcpdump:

# FILENAME="/host/var/tmp/${HOSTNAME}_$(date +%d_%m_%Y-%H_%M_%S-%Z).pcap"
# tcpdump -nn -s 0 -i ${INTERFACE} -w ${FILENAME} ${TCPDUMP_EXTRA_PARAMS}

Copy the pcap file outside the node before you end the debug session:

scp /var/tmp/XXXXX.pcap USER@MY_ANALYSIS_MACHINE:~/

1.3) Running tcpdump on a pod

This is a streamlined description from the Red Hat support document, a detailed description can be found here: https://access.redhat.com/solutions/4569211

Start a debug session on the hosting node of the pod

oc debug node/<nodename>

Define environment variables:

In OpenShift 4.8 and earlier:

# NAME=<pod-name>
# NAMESPACE=<pod-namespace>
# pod_id=$(chroot /host crictl pods --namespace ${NAMESPACE} --name ${NAME} -q)
# pid=$(chroot /host bash -c "runc state $pod_id | jq .pid")

In OpenShift 4.9 and later:

# NAME=<pod-name>
# NAMESPACE=<pod-namespace>
# pod_id=$(chroot /host crictl pods --namespace ${NAMESPACE} --name ${NAME} -q)
# ns_path="/host/$(chroot /host bash -c "crictl inspectp $pod_id | jq '.info.runtimeSpec.linux.namespaces[]|select(.type==\"network\").path' -r")"

Check the interface name (needs to be adjusted in second command):

# nsenter -n -t $pid -- tcpdump -D
# nsenter -n -t $pid -- tcpdump -nn -i ${INTERFACE} -w /host/tmp/${HOSTNAME}_$(date +\%d_%m_%Y-%H_%M_%S-%Z).pcap

Press ctrl+c when you are ready to end capture. Before exiting the debug container, copy the packet capture outside of the cluster, e.g. with scp.

1.4) Additional tools to monitor the data flows on the network

1.4.1 mpstat

The mpstat tool (which is part of the sysstat package) can be installed and used to monitor the system utilization. I used a quick script for logging the data utilisation during the test runs (end with ctrl+c):


#!/bin/bash 
while (true)
do mpstat -P ALL 1
done
1.4.2 Must-Gather

Red Hat provides a script to collect cluster information, which was also used for collecting data during my test scenarios: 

$ oc adm must-gather
1.4.3 sosreport

This is a streamlined version of the Red Hat documentation, details can be found here: https://access.redhat.com/solutions/4387261

Connect to the node in debug mode

oc debug node/<NODENAME>
sh-4.2# chroot /host bash

Start the toolbox and execute the sosreport (enable the allow-system-changes option, else not all features might be available):

[root@MYNODE /]# toolbox
[root@MYNODE /]# sosreport -k crio.all=on -k crio.logs=on  --allow-system-changes
1.4.4 The Open vSwitch logs

The following is a script to monitor a number of parameters of special interest for the Open vSwitch component:


#!/bin/bash

echo "ovs-appctl coverage/show"
ovs-appctl coverage/show

while (true)
do

date >> 001_ovs-dpctl-dump-flows.txt
ovs-dpctl dump-flows -m >> 001_ovs-dpctl-dump-flows.txt

date >> 001_ovs-ofctl-dump-flows.txt
ovs-ofctl -O OpenFlow13 dump-flows br0 >> 001_ovs-ofctl-dump-flows.txt

date >> 001_ovs-appctl.txt
ovs-appctl upcall/show >> 001_ovs-appctl.txt

date >> 001_nf_conntrack_count.txt
cat /proc/sys/net/netfilter/nf_conntrack_count >> 001_nf_conntrack_count.txt

date >> 001_nf_conntrack_max.txt
cat /proc/sys/net/nf_conntrack_max >> 001_nf_conntrack_max.txt

date >> 001_nf_conntrack.txt
cat /proc/net/stat/nf_conntrack >> 001_nf_conntrack.txt

date >> 001_softnet_stat.txt
cat /proc/net/softnet_stat >> 001_softnet_stat.txt

date
sleep 1
done
1.5) Summary

We now have learned what tools exist and can be used in each specific scenario. In the next section we will cover the usage in our specific problem scenario.

2) Data collection process

In the last section we saw what tools are available, now we need to combine these to capture the complete picture. On the one side there is the general utilization, while on the other we need to track the complete packet flow.

OpenShift has multiple network layers. The easiest approach we choose for our test, was to scale all pods involved in the problem down to a single pod of this service and place the involved pods on the same node (Howto force pods on the same node via node selectors see: https://docs.openshift.com/container-platform/4.6/nodes/scheduling/nodes-scheduler-node-selectors.html).

The network on the same node will be routed via an Open vSwitch. This switch is a critical component and also was part of the problem I found. If the pods are placed on different nodes the SDN layer will also play a role there, this becomes significantly difficult to track all involved components, therefore it is highly recommended if possible to test on the same node for data capture.

2.1 The network flow

We have seen how to capture tcpdumps on pods. But a critical involved component is the Open vSwitch. To capture also the ports of the virtual switch one needs to find the ports in use.

First find the Node where your pod is running on. This can be done by the following command:

oc get pods -o wide

Then you can connect to the corresponding node in debug mode (replace my_node with your node name):

oc debug node/my_node

In the debug session execute the following commands (this will allow you to execute the ip command of the node inside the pod, replace brackets for pod-name and pod-namespace with your values):


sh-4.4# NAME=<pod-name>  
sh-4.4# NAMESPACE=<pod-namespace>
sh-4.4# pod_id=$(chroot /host crictl pods --namespace ${NAMESPACE} --name ${NAME} -q)
sh-4.4# pid=$(chroot /host bash -c "runc state $pod_id | jq .pid")
sh-4.4# nsenter -n -t $pid -- ip a

A sample output can be seen here:


sh-4.4# nsenter -n -t $pid -- ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
3: eth0@if3187: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default 
    link/ether 0a:58:0a:82:02:b1 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.130.2.177/23 brd 10.130.3.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::e43e:b6ff:feeb:49a6/64 scope link 
       valid_lft forever preferred_lft forever

From the output you can see the involved interface. You then can map the interface to the node network interface which needs to be captured with tcpdump. For this you run the ip command on the node itself:


sh-4.4# ip a 
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
...
3187: veth5b024d2c@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master ovs-system state UP group default 
    link/ether 7e:76:59:e3:7d:84 brd ff:ff:ff:ff:ff:ff link-netnsid 5
    inet6 fe80::7c76:59ff:fee3:7d84/64 scope link 
       valid_lft forever preferred_lft forever

Thus to capture the OpenvSwitch port you need to capture tcpdump data from the veth5b024d2c network interface.


3) The results

As part of our data collection we could identify 2 issues with OpenShift 4. The following subsections will give some further description of the observed problems (data analysis shared here has been done by the Red Hat support, special thanks to Eelco) and their solutions. You can skip to the end of each section to read the summary paragraph if you are not interested in the technical details of the issue.

3.1 NAT translation issue

In our scenario we are using a BAW and a CPE pod which are part of the Business Automation Workflow on Container offering. We noticed the problem from the application side with a socket timeout for some calls, while other calls a few millisecond apart succeeded without any issue. In the tcpdump we could see that nearly only control packets were affected by the packet retransmission. The following is the packet flow as it was seen for a specific scenario of our tests (analysis done by Red Hat support).

CPE starting a connection to BAW service IP address:
10.128.3.202:60170 -> 172.30.249.115:9443 [S] (1713889923,0)
10.128.3.202:60170 -> 172.30.249.115:9443 [S] (1713890930,0)
10.128.3.202:60170 -> 172.30.249.115:9443 [S] (1713892981,0)
10.128.3.202:60170 -> 172.30.249.115:9443 [S] (1713897019,0)
10.128.3.202:60170 -> 172.30.249.115:9443 [S] (1713905096,0)

Note that there are only TCP SYN packets to establish a connection. On the BAW side we see:

BAW:
10.128.3.202:59906 -> 10.128.3.214:9443 [S] (1713889923,0)
10.128.3.214:9443 -> 10.128.3.202:59906 [SA] (3455876045,1713889923)
10.128.3.214:9443 -> 10.128.3.202:59906 [SA] (3455877052,1713889923)
10.128.3.202:59906 -> 10.128.3.214:9443 [S] (1713890930,0)
10.128.3.214:9443 -> 10.128.3.202:59906 [SA] (3455877053,1713889923)
10.128.3.214:9443 -> 10.128.3.202:59906 [SA] (3455879103,1713889923)
10.128.3.202:59906 -> 10.128.3.214:9443 [S] (1713892981,0)

Notice that BAW is replying to CPE, but we didn't see that before in CPE output. If one digs deeper into CPE one can find those missing packets:

CPE:
10.128.3.214:9443 -> 10.128.3.202:60170 [SA] (3455876045,1713889923)
10.128.3.202:60170 -> 10.128.3.214:9443 [R] (-,-)
10.128.3.214:9443 -> 10.128.3.202:60170 [SA] (3455877052,1713889923)
10.128.3.202:60170 -> 10.128.3.214:9443 [R] (-,-)
10.128.3.214:9443 -> 10.128.3.202:60170 [SA] (3455877053,1713889923)
10.128.3.202:60170 -> 10.128.3.214:9443 [R] (-,-)
10.128.3.214:9443 -> 10.128.3.202:60170 [SA] (3455879103,1713889923)
10.128.3.202:60170 -> 10.128.3.214:9443 [R] (-,-)


OK, but then where are the TCP RST packets in the BAW side? Looking deeper into BAW side one can find them here:

BAW:
10.128.3.202.60170 > 10.128.3.214.tungsten-https: Flags [R]
10.128.3.202.60170 > 10.128.3.214.tungsten-https: Flags [R]
10.128.3.202.60170 > 10.128.3.214.tungsten-https: Flags [R]
10.128.3.202.60170 > 10.128.3.214.tungsten-https: Flags [R]
10.128.3.202.60170 > 10.128.3.214.tungsten-https: Flags [R]


In summary:

1. CPE started a connection to BAW service IP.
2. BAW received a DNAT'ed 5-tuple and replied back.
3. SYN-ACK didn't get unDNAT'ed, so it's unexpected to CPE.
4. CPE then correctly sends a TCP RST back directly to BAW (no DNAT).
5. BAW receives the TCP RST, but ignores since the 5-tuple is unexpected.
6. CPE retransmits TCP SYN which goes all over again.
7. The connection fails to establish.

The problem is at step #3 in which SYN-ACK is replied directly from BAW to CPE without unDNAT the 5-tuple. The solution is provided with OpenShift 4.7.1 of the fast channel, which includes a kernel update fixing the issue (RHEL 8.3 kernel version 4.18.0-254.el8 and above).

This also explains why the workaround described in section 3.3) can help to reduce the occurrence of the issue (for the case you are running an earlier OpenShift version, as this fix will not be backported) when the property max-idle is enabled:  Good flows are longer retained in the cache preventing the issue to happen.

3.2 Collision of service IP and pod IP

The following data was collected with already a fix applied for the NAT issue described in the earlier section. So looking at the new set of traces we no longer see the wrongly translated NAT address issue. However, we still see the issue where setting up a TCP session sometimes needs a number of retries before they succeed. We again have the same scenario of BAW and CPE container communication (10.128.2.10:xxxx->10.128.2.6:9443). The analysis again is done by the Red Hat support.

The captured data implied ephemeral port collisions between the direct traffic, and the service NATted traffic. For example:

- 10.128.2.10:36440->10.128.2.6:9443, seq=1284738543:
TX LEFT >> number:12324, time: 2021-02-15 13:53:53.103262 UTC
TX LEFT >> number:12351, time: 2021-02-15 13:53:54.129987 UTC
TX LEFT >> number:12392, time: 2021-02-15 13:53:56.177958 UTC
TX LEFT >> number:12469, time: 2021-02-15 13:54:00.212106 UTC
TX LEFT >> number:12626, time: 2021-02-15 13:54:08.593946 UTC
RX COLL >> number:8298, time: 2021-02-15 13:52:18.652697 UTC
TX ??? >> number:10056, time: 2021-02-15 13:52:18.652517 UTC: 10.128.2.10:36440->172.30.32.208:36440, seq=4259382902

In summary:
The problem has been accepted by Red Hat and a fix is planned for OpenShift 8, while there are also plans for backporting to OpenShift 4.6 and 4.7. In context of issue 1 most likely a move to 4.7 might be more adviseable.

3.3 Workaround

The following commands can be used to reduce the occurrence of the packet retransmissions (this is a node level property). This will not completely prevent the issue, it will only reduce the impact

First you can check what properties are defined for the Open vSwitch:


sh-4.4# ovs-vsctl list Open_vSwitch
_uuid               : aa610328-28de-4c80-a00b-7e95fc03d276
bridges             : [e8cd5666-1392-4de4-b1c9-aae7811227e1]
cur_cfg             : 126
datapath_types      : [netdev, system]
datapaths           : {}
db_version          : "8.2.0"
dpdk_initialized    : false
dpdk_version        : "DPDK 19.11.3"
external_ids        : {hostname=localhost, rundir="/var/run/openvswitch", system-id="f62337f3-7e73-4490-89ad-4aa1461e8614"}
iface_types         : [erspan, geneve, gre, internal, ip6erspan, ip6gre, lisp, patch, stt, system, tap, vxlan]
manager_options     : []
next_cfg            : 126
other_config        : {}
ovs_version         : "2.13.2"
ssl                 : []
statistics          : {}
system_type         : rhcos
system_version      : "4.6"

The following command will increase the cache timeout of the Open vSwitch (this is the actual fix):


ovs-vsctl --no-wait set Open_vSwitch . other_config:max-idle=50000

If you received a fix for the above two earlier mentioned issues or you want to restore the default behaviour, you can remove the setting. It is important to check that there are in the meantime no other settings in place for the other_config property, which you want to keep, before doing so.


ovs-vsctl --no-wait clear Open_vSwitch . other_config

4) Conclusion

The described procedures allowed me to figure out issues which were not understood before from a product perspective. An adjustment to specific scenarios can be required. As this is not so often done I thought it would be useful to share.  

At the current point of writing (15th March 2021) the first observed issue is fixed with OpenShift 4.7.1 Fast path channel (required is a Linux kernel version of RHEL 8.3 kernel version 4.18.0-254.el8 and above). The second issue is planned to be fixed with OpenShift 4.8 and will receive a backport for 4.6 and 4.7 (most likely in 4.7.3 or 4.7.4 latest).

Update from 14th June 2021: In a new test with OpenShift 4.7.13 both issues have been fixed. Similar problems could be also seen for loss of FIN ACK packets, which should be also fixed as part of the second issue fix (this has not yet been verified though).

Special thanks to my colleagues from IBM/Red Hat for their support during the observed issues: Eelco, Lu, Manish, Marica

R) References:

https://access.redhat.com/solutions/4569211 How to use tcpdump inside OpenShift v4 Pod
https://access.redhat.com/solutions/5074041 How to provide a tcpdump from a RHEL CoreOS OpenShift node
https://access.redhat.com/solutions/3820762 How to generate a sosreport in Red Hat Enterprise Linux CoreOS 4.x with SSH access to master nodes?
https://access.redhat.com/solutions/4387261 How to generate SOSREPORT within OpenShift4 nodes without SSH
https://github.com/openshift/sdn/pull/269 Bug 1910378: networkpolicy: pass traffic through NAT to handle possible tuple collisions #269
https://github.com/openshift/sdn/pull/272 Bug 1936920: networkpolicy: pass traffic through NAT to handle possible tuple collisions #272



2 comments
73 views

Permalink

Comments

Wed March 24, 2021 09:40 AM

Hi Vincent, the RH case number is 02814115. A fix for the second issue is planned for 4.7.3 or latest 4.7.4.

Tue March 23, 2021 08:26 PM

Nice work Stephan, do you have RH case number for the 2 issues and/or do you know when they will address the 2 issues mentioned?