0. Introduction
When performing performance tests with OpenShift some configuration changes might be required. In the following I will also list some activities which could be a blocker and might not be so obvious how to be solved.
1. Pre-reqs and configuration settings of hardware for OpenShift
1.1 What hardware is required and how one can upgrade
Sizing the system in a good way is important to not hit bottlenecks or wasting resources. We will see some hints and tips here in the future.
1.2 Disk space for worker nodes and evicted pods
During the tests it turned out that the hard disk for the worker nodes was sized not large enough. An indicator can be that there are a lot of evicted pods on your system. We are running OpenShift on a VMWare vSphere controlled hypervisor. Thus the solution was rather simple. Shutdown the OpenShift cluster as described in the documentation (I will at a later point add some more detailed description here). Increase the disk size via the VMWare tools. CoreOS of the worker nodes will reflect the change to the OpenShift disk information, but the filesystem will not be automatically increased. To do this, you need to connect to each worker node via a debug session, use the chroot /host command. To increase the filesystem, the filesystem needs to be writeable, which is not the case by default. This can be done by the following command (check the device name which does not need to be /dev/sda4, use fdisk -l ):
sudo mount -o remount,rw,relatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,prjquota /dev/sda4 /sysroot
When the filesystem is writable, you can execute the xfs command to increase the filesystem to the full partition (be aware that a decrease at a later point in time is not possible):
xfs_growfs /sysroot
I did not figure out a way to get the filesystem back in read-only mode without a cluster reboot (might be added at a later point in time).
2. Network configuration settings which matter
During high load tests I observed packet loss and TCP retransmission. This was the starter to dig deeper in the OpenShift network layers.
2.1 The Open vSwitch configuration
During stress testing of the Business Automation Workflow on Container product I observed a significant number of packet retransmissions and lost packets. I found an hint on the max-idle parameter of the Open vSwitch, which can have a significant impact on reliability especially in the benchmarking context.
Thus how to set the max-idle setting? Again open a debug session to the worker node and execute: chroot /host. First check the the current settings, the parameter should not be set normally:
ovs-vsctl list Open_vSwitch
Now you can set the parameter. You might want to experiment with the value being set to (values in ms);
ovs-vsctl --no-wait set Open_vSwitch . other_config:max-idle=50000
To remove the setting you can use the following command (be aware on other settings being set there):
ovs-vsctl --no-wait clear Open_vSwitch . other_config
There exists a potential issue for lost packets which is supposed to be fixed with a later OpenShift 4.7.z (Kernel 4.18.0-254.el8 or later required). Be aware that there could be more issues with lost packets and packet retransmission, thus expect to read more on this topic in the future. Also you might want to check the Open vSwitch FAQ .
2.2 How to debug such a packet loss scenario
After the issue of packet loss was identified I raised a ticket with the OpenShift support. We agreed on the following steps for data collection. The main idea here is to have a single pod to pod communication which ideally happens on the same node. As soon as you would leave the node further components like SDN will be involved which complicate things. The plan then is to collect tcpdumps not only on the involved pods, but also on the involved Open vSwitch, which performs the node internal networking on OpenShift.
For the moment I will add a number of references here to share the idea, at a later point in time I might streamline this a little.
How to perform the tcpdump
The data collection for the pods is rather straightforward described in https://access.redhat.com/solutions/4569211
NAME=demo-instance1-baw-server-0
NAMESPACE=demo
pod_id=$(chroot /host crictl pods --namespace ${NAMESPACE} --name ${NAME} -q)
pid=$(chroot /host bash -c "runc state $pod_id | jq .pid")
nsenter -n -t $pid -- tcpdump -D
nsenter -n -t $pid -- tcpdump -nn -i eth0 -w /host/tmp/${HOSTNAME}_baw_$(date +%d_%m_%Y-%H_%M_%S-%Z).pcap
To track the specific Open vSwitch ports you need to figure out the network interface to be traced on the node. This can be done in the following way:
sh-4.4# NAME=demo-instance1-baw-server-0
sh-4.4# NAMESPACE=demo
sh-4.4# pod_id=$(chroot /host crictl pods --namespace ${NAMESPACE} --name ${NAME} -q)
sh-4.4# pid=$(chroot /host bash -c "runc state $pod_id | jq .pid")
sh-4.4# nsenter -n -t $pid -- ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen
1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
3: eth0@if19: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group d
efault
link/ether 0a:58:0a:80:02:0a brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 10.128.2.10/23 brd 10.128.3.255 scope global eth0
valid_lft forever preferred_lft forever
inet6 fe80::3090:a3ff:fe26:814c/64 scope link
valid_lft forever preferred_lft forever
sh-4.4# ip a
...
19: veth8dbb30bb@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master o
vs-system state UP group default
link/ether 72:67:cc:ed:d2:5b brd ff:ff:ff:ff:ff:ff link-netnsid 8
inet6 fe80::7067:ccff:feed:d25b/64 scope link
valid_lft forever preferred_lft forever
FILENAME="/host/var/tmp/${HOSTNAME}_baw_veth_$(date +%d_%m_%Y-%H_%M_%S-%Z).pcap"
tcpdump -nn -s 0 -i veth8dbb30bb -w ${FILENAME}
Additional data to be collected
1) To keep an eye on the system utilisation the following script can be used (it requires the install of the mpstat command which is part of the sysstat package):
#!/bin/bash
while (true)
do
mpstat -P ALL 1
done
2) To simplify the data collection, I wrote a small script to capture data every second (this needs to be run on the node with chroot /host and I pipe the first output to a file. One for sure can improve the script further, but it did the job for me.):
#!/bin/bash
echo "ovs-appctl coverage/show"
ovs-appctl coverage/show
while (true)
do
date >> 001_ovs-dpctl-dump-flows.txt
echo "ovs-dpctl dump-flows -m :::"
ovs-dpctl dump-flows -m >> 001_ovs-dpctl-dump-flows.txt
date >> 001_ovs-ofctl-dump-flows.txt
echo "ovs-ofctl -O OpenFlow13 dump-flows br0 :::"
ovs-ofctl -O OpenFlow13 dump-flows br0 >> 001_ovs-ofctl-dump-flows.txt
date >> 001_ovs-appctl.txt
echo "ovs-appctl upcall/show :::"
ovs-appctl upcall/show >> 001_ovs-appctl.txt
date >> 001_nf_conntrack_count.txt
echo "/proc/sys/net/netfilter/nf_conntrack_count"
cat /proc/sys/net/netfilter/nf_conntrack_count >> 001_nf_conntrack_count.txt
date >> 001_nf_conntrack_max.txt
echo "/proc/sys/net/nf_conntrack_max"
cat /proc/sys/net/nf_conntrack_max >> 001_nf_conntrack_max.txt
date >> 001_nf_conntrack.txt
echo "/proc/net/stat/nf_conntrack"
cat /proc/net/stat/nf_conntrack >> 001_nf_conntrack.txt
date >> 001_softnet_stat.txt
echo "/proc/net/softnet_stat"
cat /proc/net/softnet_stat >> 001_softnet_stat.txt
date
sleep 1
done
After the test
When the test completed some additional data is required. Rerun the following command and pipe it to a file:
ovs-appctl coverage/show
Collect besides the output files and tcpdump also the OVS logs, you can find under /var/log/openvswitch/
Collect a sosreport as described here (I performed a small change to the command by adding the --allow-system-changes option, else not all data will be collected): https://access.redhat.com/solutions/5065411 and https://access.redhat.com/solutions/4387261
sosreport -k crio.all=on -k crio.logs=on --allow-system-changes
3. Upgrade issues of OpenShift
In some cases the rather well performing upgrade of OpenShift versions will not work as expected. Here I will list the things which came up over time.
3.1 Monitoring stays in degraded state after an upgrade
During upgrade of OpenShift 4.5 to the latest OpenShift 4.6.9 I noticed that the upgrade did not complete successfully. The exception was: '. This happened for the monitoring cluster operator which was in degraded state. A research on the exception did not give a hint what might be the issue. But when digging a little deeper, it turned out that killing the running pods from before the upgrade in the openshift-monitoring namespace was sufficient to let the upgrade succeed successfully at the end.