BPM, Workflow, and Case

Considerations for OpenShift clusters with high workloads of business automation workflows

By Stephan Volz posted Thu January 07, 2021 05:26 PM


0. Introduction

When performing performance tests with OpenShift some configuration changes might be required. In the following I will also list some activities which could be a blocker and might not be so obvious how to be solved.

1. Pre-reqs and configuration settings of hardware for OpenShift

1.1 What hardware is required and how one can upgrade

Sizing the system in a good way is important to not hit bottlenecks or wasting resources. We will see some hints and tips here in the future.

1.2 Disk space for worker nodes and evicted pods

During the tests it turned out that the hard disk for the worker nodes was sized not large enough. An indicator can be that there are a lot of evicted pods on your system. We are running OpenShift on a VMWare vSphere controlled hypervisor. Thus the solution was rather simple. Shutdown the OpenShift cluster as described in the documentation (I will at a later point add some more detailed description here). Increase the disk size via the VMWare tools. CoreOS of the worker nodes will reflect the change to the OpenShift disk information, but the filesystem will not be automatically increased. To do this, you need to connect to each worker node via a debug session, use the chroot /host command. To increase the filesystem, the filesystem needs to be writeable, which is not the case by default. This can be done by the following command (check the device name which does not need to be /dev/sda4, use fdisk -l ):

sudo mount -o remount,rw,relatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,prjquota /dev/sda4 /sysroot

When the filesystem is writable, you can execute the xfs command to increase the filesystem to the full partition (be aware that a decrease at a later point in time is not possible):

xfs_growfs /sysroot

I did not figure out a way to get the filesystem back in read-only mode without a cluster reboot (might be added at a later point in time).

2. Network configuration settings which matter

During high load tests I observed packet loss and TCP retransmission. This was the starter to dig deeper in the OpenShift network layers.

2.1 The Open vSwitch configuration

During stress testing of the Business Automation Workflow on Container product I observed a significant number of packet retransmissions and lost packets. I found an hint on the max-idle parameter of the Open vSwitch, which can have a significant impact on reliability especially in the benchmarking context.

Thus how to set the max-idle setting? Again open a debug session to the worker node and execute: chroot /host. First check the the current settings, the parameter should not be set normally:

ovs-vsctl list Open_vSwitch

Now you can set the parameter. You might want to experiment with the value being set to (values in ms);

ovs-vsctl --no-wait set Open_vSwitch . other_config:max-idle=50000

To remove the setting you can use the following command (be aware on other settings being set there):

ovs-vsctl --no-wait clear Open_vSwitch . other_config

There exists a potential issue for lost packets which is supposed to be fixed with a later OpenShift 4.7.z (Kernel 4.18.0-254.el8 or later required). Be aware that there could be more issues with lost packets and packet retransmission, thus expect to read more on this topic in the future. Also you might want to check the Open vSwitch FAQ .

2.2 How to debug such a packet loss scenario

After the issue of packet loss was identified I raised a ticket with the OpenShift support. We agreed on the following steps for data collection. The main idea here is to have a single pod to pod communication which ideally happens on the same node. As soon as you would leave the node further components like SDN will be involved which complicate things. The plan then is to collect tcpdumps not only on the involved pods, but also on the involved Open vSwitch, which performs the node internal networking on OpenShift.

For the moment I will add a number of references here to share the idea, at a later point in time I might streamline this a little.

How to perform the tcpdump

The data collection for the pods is rather straightforward described in https://access.redhat.com/solutions/4569211

pod_id=$(chroot /host crictl pods --namespace ${NAMESPACE} --name ${NAME} -q)
pid=$(chroot /host bash -c "runc state $pod_id | jq .pid")

nsenter -n -t $pid -- tcpdump -D

nsenter -n -t $pid -- tcpdump -nn -i eth0 -w /host/tmp/${HOSTNAME}_baw_$(date +%d_%m_%Y-%H_%M_%S-%Z).pcap

To track the specific Open vSwitch ports you need to figure out the network interface to be traced on the node. This can be done in the following way:

sh-4.4# NAME=demo-instance1-baw-server-0
sh-4.4# NAMESPACE=demo
sh-4.4# pod_id=$(chroot /host crictl pods --namespace ${NAMESPACE} --name ${NAME} -q)
sh-4.4# pid=$(chroot /host bash -c "runc state $pod_id | jq .pid")
sh-4.4# nsenter -n -t $pid -- ip a

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00                              
    inet scope host lo                                                     
       valid_lft forever preferred_lft forever                                         
    inet6 ::1/128 scope host                                                           
       valid_lft forever preferred_lft forever                                         
3: eth0@if19: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group d
    link/ether 0a:58:0a:80:02:0a brd ff:ff:ff:ff:ff:ff link-netnsid 0                  
    inet brd scope global eth0                             
       valid_lft forever preferred_lft forever                                         
    inet6 fe80::3090:a3ff:fe26:814c/64 scope link                                      
       valid_lft forever preferred_lft forever     

sh-4.4# ip a
19: veth8dbb30bb@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master o
vs-system state UP group default                                                       
    link/ether 72:67:cc:ed:d2:5b brd ff:ff:ff:ff:ff:ff link-netnsid 8                  
    inet6 fe80::7067:ccff:feed:d25b/64 scope link                                      
       valid_lft forever preferred_lft forever  

FILENAME="/host/var/tmp/${HOSTNAME}_baw_veth_$(date +%d_%m_%Y-%H_%M_%S-%Z).pcap"

tcpdump -nn -s 0 -i veth8dbb30bb -w ${FILENAME} 

Additional data to be collected

1) To keep an eye on the system utilisation the following script can be used (it requires the install of the mpstat command which is part of the sysstat package):


while (true)
        mpstat -P ALL 1 

2) To simplify the data collection, I wrote a small script to capture data every second (this needs to be run on the node with chroot /host and I pipe the first output to a file. One for sure can improve the script further, but it did the job for me.):


echo "ovs-appctl coverage/show"
ovs-appctl coverage/show

while (true)
   date >> 001_ovs-dpctl-dump-flows.txt
   echo "ovs-dpctl dump-flows -m  :::"
   ovs-dpctl dump-flows -m  >> 001_ovs-dpctl-dump-flows.txt  
   date  >> 001_ovs-ofctl-dump-flows.txt
   echo "ovs-ofctl -O OpenFlow13 dump-flows br0  :::"
   ovs-ofctl -O OpenFlow13 dump-flows br0 >> 001_ovs-ofctl-dump-flows.txt 

   date >> 001_ovs-appctl.txt
   echo "ovs-appctl upcall/show  :::"
   ovs-appctl upcall/show >> 001_ovs-appctl.txt  

   date >> 001_nf_conntrack_count.txt
   echo "/proc/sys/net/netfilter/nf_conntrack_count"
   cat /proc/sys/net/netfilter/nf_conntrack_count >> 001_nf_conntrack_count.txt

   date >> 001_nf_conntrack_max.txt
   echo "/proc/sys/net/nf_conntrack_max"
   cat /proc/sys/net/nf_conntrack_max >> 001_nf_conntrack_max.txt

   date >> 001_nf_conntrack.txt
   echo "/proc/net/stat/nf_conntrack" 
   cat /proc/net/stat/nf_conntrack >> 001_nf_conntrack.txt 

   date >> 001_softnet_stat.txt
   echo "/proc/net/softnet_stat"
   cat /proc/net/softnet_stat >> 001_softnet_stat.txt 

   sleep 1


After the test

When the test completed some additional data is required. Rerun the following command and pipe it to a file:

ovs-appctl coverage/show

Collect besides the output files and tcpdump also the OVS logs, you can find under /var/log/openvswitch/

Collect a sosreport as described here (I performed a small change to the command by adding the --allow-system-changes option, else not all data will be collected): https://access.redhat.com/solutions/5065411 and https://access.redhat.com/solutions/4387261

sosreport -k crio.all=on -k crio.logs=on --allow-system-changes

3. Upgrade issues of OpenShift

In some cases the rather well performing upgrade of OpenShift versions will not work as expected. Here I will list the things which came up over time.

3.1 Monitoring stays in degraded state after an upgrade

During upgrade of OpenShift 4.5 to the latest OpenShift 4.6.9 I noticed that the upgrade did not complete successfully. The exception was: '. This happened for the monitoring cluster operator which was in degraded state. A research on the exception did not give a hint what might be the issue. But when digging a little deeper, it turned out that killing the running pods from before the upgrade in the openshift-monitoring namespace was sufficient to let the upgrade succeed successfully at the end.