Cloud Platform as a Service

Cloud Platform as a Service

Join us to learn more from a community of collaborative experts and IBM Cloud product users to share advice and best practices with peers and stay up to date regarding product enhancements, regional user group meetings, webinars, how-to blogs, and other helpful materials.

 View Only

Understanding IO Performance on Red Hat Openshift on IBM Cloud

By Daniel McGinnes posted Tue September 19, 2023 08:39 AM

  

Introduction

IO performance is a critical part of overall application performance. This article explains the expected IO performance for PVCs in Red Hat OpenShift on IBM Cloud clusters as well as the tools and techniques you can use when investigating performance. In this article we use Red Hat Openshift on IBM Cloud (https://www.ibm.com/products/openshift) as a Kubernetes provider, but much of the information also applies to other Kubernetes environments.

Testing disk performance with fio

To test the IO capabilities of a configuration, we can use fio (https://fio.readthedocs.io/en/latest/fio_doc.html). Fio is a disk benchmarking tool, which has a lot of flexibility to run a wide range of different IO workloads.

In some cases, you might be interested in how a certain database performs, but for the purpose of this article fio is a simple way to benchmark the disk IO performance while not bringing in other variables that can affect IO performance in databases, such as threading model & CPU constraints. When investigating performance problems, you may also find it useful to use fio to test if you are getting the expected performance from the disks.

To begin benchmarking with fio, we first need a PersistentVolumeClaim to represent the persistent storage that our pod will mount:

vpc-block-pvc.yaml:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: perf-pvc
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 300Gi
  storageClassName: ibmc-vpc-block-10iops-tier

We also need a pod that mounts the PersistentVolumeClaim:

ubuntu-pod.yaml:

apiVersion: v1
kind: Pod
metadata:
  name: ubuntu-test
  labels:
    app: ubuntu-test
spec:
  containers:
  - name: ubuntu
    image: ubuntu:20.04
    imagePullPolicy: IfNotPresent
    command: ["/bin/sleep", "7d"]
    resources: {}
    securityContext:
      privileged: true
      runAsUser: 0
    volumeMounts:
    - name: persistent-storage-mount
      mountPath: "/var/perfps"
  volumes:
  - name: persistent-storage-mount
    persistentVolumeClaim:
      claimName: perf-pvc
  restartPolicy: Always

Once we have created these as yaml files we can install them in the cluster:

oc apply -f vpc-block-pvc.yaml

oc apply -f ubuntu-pod.yaml

After a short amount of time the storage will be provisioned and the pod will enter Running state:

oc get pod -o=wide
NAME          READY   STATUS    RESTARTS   AGE   IP              NODE          NOMINATED NODE   READINESS GATES
ubuntu-test   1/1     Running   0          68s   172.17.57.212   10.242.0.25   <none>           <none>

Now we can exec into the pod and install fio:

oc exec ubuntu-test -it -- sh

Once inside the pod we can install fio:

apt-get update

apt-get install fio

Running tests with fio

We are now ready to run some tests to measure the IO performance. The arguments can be tailored to be representative of your workload, but here are some examples:

# Test maximum random read IOPS with small block size:

fio --ioengine=libaio --iodepth=64 --direct=1 --size=5G --time_based --group_reporting --runtime=180 --ramp_time=10 --bs=4k --rw=randread --directory=/var/perfps --name=4k-64depth-read

# Test maximum sequential read Bandwidth with large block size:

fio --ioengine=libaio --iodepth=64 --direct=1 --size=5G --time_based --group_reporting --runtime=180 --ramp_time=10 --bs=1M --rw=read --directory=/var/perfps --name=1m-64depth-read

# Test maximum random write IOPS with small block size:

fio --ioengine=libaio --iodepth=64 --direct=1 --size=5G --time_based --group_reporting --runtime=180 --ramp_time=10 --bs=4k --rw=randwrite --directory=/var/perfps --name=4k-64depth-write

# Test maximum random read & write (50-50) IOPS with small block size:

fio --ioengine=libaio --iodepth=64 --direct=1 --size=5G --time_based --group_reporting --runtime=180 --ramp_time=10 --bs=4k --rw=randrw --rwmixread=50 --directory=/var/perfps --name=4k-64depth-read-write

# Test maximum single threaded random read IOPS with small block size:

fio --ioengine=libaio --iodepth=1 --direct=1 --size=5G --time_based --group_reporting --runtime=180 --ramp_time=10 --bs=4k --rw=randread --directory=/var/perfps --name=4k-64depth-read

The output and description from a run of one of these tests can be seen below:

Graphical user interface, text, application, email

Description automatically generated

Figure 1 – FIO results with 4KB block size

1.     The Read IOPS and bandwidth for the run.

2.     Latency information for the run (pay attention to the units as these can change – in this case they are shown in microseconds

3.     Statistical information on IOPS & Bandwidth during the run, based on samples taken at different points.

4.     The Write IOPS and Bandwidth for the run – you need to add together the Read & Write values to get the totals.

5.     Information about the device that was used during the test

Understanding expected IO performance

When you create a PersistentVolumeClaim you specify a Storage Class and a size for the storage. Both the storage class and the size of the disk can affect the IOPS and bandwidth you can expect to achieve. For more information on the available storage classes see the links under Additional Resources.

Our example above uses the ibmc-vpc-block-10iops-tier storage class, with a 300Gi volume, so we can expect to be able to get 3000 IOPS against this device. For this storage class the IOPS are based on 256KB block size so we can expect to be able to reach 750 MB/sec (3000 x 256KB). Note you will get limited if you reach either the IOPS or Bandwidth limits, so if running with smaller block sizes you can expect to hit the IOPS limit first, but with larger block sizes you will hit the Bandwidth limit first.

Also note 3000 IOPS is the minimum IOPS limit, so even if we used a 100Gi device we would still expect to reach 3000 IOPS.

So now we will run a test to see if we reach the expected limits:

The run shown in Figure 1 above was with a 4K block size, and we can see we hit the expected 3000 IOPS limit. Now we will try another test with 256KB block size, to see if we can reach the 750MB/sec bandwidth.

Here are the results with 256KB block size:

Text, application, email

Description automatically generated

Figure 2 – FIO results with 256KB block size

We can see we get a combined bandwidth of 200 MB/sec (Around 100MB/sec reads & 100 MB/sec writes), which is considerably less than the 750 MB/sec we were expecting. This can be explained by the bandwidth allocation to VSIs.

Understanding VSI Bandwidth Allocation

When looking at IO performance we also need to consider the limits for the VSI along with the limits for the storage we provision. As documented here https://cloud.ibm.com/docs/vpc?topic=vpc-capacity-performance&interface=ui#cp-storage-bandwidth-allocate an amount of the VSI’s available networking bandwidth is allocated to block volume traffic, and the rest is reserved for application use. Note for Red Hat Openshift on IBM Cloud (ROKS) and IBM Kubernetes Service (IKS) worker nodes it is not currently possible to adjust the bandwidth allocation ratio, so you will always get 25% of the instance’s bandwidth allocated to block volume traffic.

If you are using VPC File Storage note that the bandwidth used for File storage volumes comes from the “networking” allocation and not the “volume” bandwidth.

Refer to this article for more information on bandwidth allocation: https://www.ibm.com/cloud/blog/bandwidth-allocation-in-virtual-server-instances

After reading the article we can deduce that our 4 core nodes have a total of 8000 Mbps network capability. 2000 Mbps will be allocated to volumes, and 393 Mbps of this is allocated to the boot volume, which leaves 1607 Mbps available for our volume. If we convert bits to Bytes, we would expect 200 MB/sec, which matches the results we see.

In order to reach our target of 750 MBytes/sec we will need 6000 Mbits/sec of volume bandwidth. If we provision a 16-core worker this will have sufficient available volume bandwidth (16 cores = 8 Gbit/sec of volume bandwidth).

So, let’s try this out by provisioning a worker with profile bx2.16x64 and repeating the same test (Note to add a worker of a different profile to the same cluster you will need to create a new worker-pool):

kube-ci3fc6cl0837gj5c5on0-vpcstorag-default-00000139   10.242.1.80   bx2.16x64   normal   Ready   eu-gb-1   4.12.19_1546_openshift

A screenshot of a computer

Description automatically generated with medium confidence

Figure 3 – FIO results with 256KB block size

Now we can see we do get the expected IOPS, with 387 MB/sec reads & 388 MB/sec writes, so a total of 775 MB/sec, just over the expected 750 MB/sec

We should also be aware of how adding extra volumes can affect the performance due to the way the VSI bandwidth allocation will be shared between volumes.

To demonstrate this we add an extra PVC & pod that have the same specifications as the originals – just different names, which will run on the same node:

vpc-block-pvc2.yaml:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: perf-pvc2
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 300Gi
  storageClassName: ibmc-vpc-block-10iops-tier

We will also need a pod that mounts the PersistentVolumeClaim:

ubuntu-pod2.yaml:

apiVersion: v1
kind: Pod
metadata:
  name: ubuntu-test2
  labels:
    app: ubuntu-test2
spec:
  containers:
  - name: ubuntu
    image: ubuntu:20.04
    imagePullPolicy: IfNotPresent
    command: ["/bin/sleep", "7d"]
    resources: {}
    securityContext:
      privileged: true
      runAsUser: 0
    volumeMounts:
    - name: persistent-storage-mount
      mountPath: "/var/perfps"
  volumes:
  - name: persistent-storage-mount
    persistentVolumeClaim:
      claimName: perf-pvc2
  restartPolicy: Always

Once we have created these as yaml files we can install them in the cluster:

oc apply -f vpc-block-pvc2.yaml

oc apply -f ubuntu-pod2.yaml

Now we have 2 pods that both mount a PVC running on the same node:

oc get pods -o=wide
NAME           READY   STATUS    RESTARTS   AGE     IP               NODE          NOMINATED NODE   READINESS GATES
ubuntu-test    1/1     Running   0          23m     172.17.150.121   10.242.1.81   <none>          <none>
ubuntu-test2   1/1     Running   0          5m40s   172.17.150.89    10.242.1.81   <none>           <none>

If we re-run the test in the original pod, we will see we get different results:

A screenshot of a computer

Description automatically generated with medium confidence

Figure 4 – FIO results with 256KB block size with 2 PVCs

Now we are getting a total of around 475 MB/sec. 

Based on what we learned about bandwidth allocation, we know that the total available storage bandwidth allocation for a host is divided amongst all attached volumes.

So, in our case the storage bandwidth limit for a 16 core host is 8000 Mbps. We take off the 393 Mbps allocated to the boot volume, which leaves us with 7607 Mbps, which will be divided evenly between the volumes attached to the host. So, each volume should get 3803 Mbps, which is equivalent to 475 MB/sec.

In addition, it should be noted that with multiple attached volumes bandwidth allocations are prorated based on each volumes unattached bandwidth allocation. In the above example if instead of 2x300GB disks we had 1x300GB and 1x600GB, the bandwidth allocations would be on a 1:2 ratio basis. The 300GB volume would be allocated 2536Mbps and. the 600GB volume would be allocated 5071Mbps.

It is important to consider all volumes attached to the host when determining if you are reaching the expected bandwidth. Also note the limits are applied when the volume is mounted, regardless of how much bandwidth each volume is using.

Note, VSI bandwidth allocation only applies to VPC worker nodes. If using classic worker nodes there is no separation of storage bandwidth, so all available bandwidth for the VSI is shared between networking & storage.

Understanding the performance differences for local disks

If your pods do not use PVCs, and instead access the pod filesystem, in VPC workers this is backed by VPC block storage. This device has 3000 IOPS at 16KB block size (so 393 Mbps). All pods using local storage on a node will share the same device.

If you need improved performance for pod local storage you can create a worker pool with higher IOPS for secondary storage. See https://cloud.ibm.com/docs/openshift?topic=openshift-planning_worker_nodes#hardware-options for more information, and use the --secondary-storage flag on the ibmcloud ks worker-pool create command.

This is different to Classic workers, which use local storage on the hypervisor for pod local storage, so in classic workers you may see increased, but less consistent IO performance.

Other considerations

It is also important to note that as per the documentation at https://cloud.ibm.com/docs/vpc?topic=vpc-block-storage-profiles&interface=ui#block-storage-profile-overview each storage profile has a maximum throughput. Even if you have not reached the limit for the IOPS or bandwidth allocation associated with the storage tier and profile of the VSI instance you have provisioned, throughput for a volume will get limited if you hit the maximum throughput for the storage profile

Monitoring IO performance

Red Hat Openshift on IBM Cloud clusters have monitoring available out of the box. This can be used to understand both the IOPS, and bandwidth being used by your applications.

To access the Openshift console for your cluster see https://cloud.ibm.com/docs/openshift?topic=openshift-access_cluster#access_oc_console

In the console if you navigate to Observe -> Metrics you can enter Prometheus queries. The following queries are useful for understanding the IO workload when using VPC Block storage:

irate(node_disk_writes_completed_total[2m]) – The rate of writes per second.

irate(node_disk_reads_completed_total[2m]) – The rate of reads per second.

irate(node_disk_written_bytes_total[2m]) – The rate of data being written per second.

irate(node_disk_read_bytes_total[2m]) – The rate of data being read per second.

rate(node_disk_read_time_seconds_total[2m]) / rate(node_disk_reads_completed_total[5m]) - Average latency for reads.

rate(node_disk_write_time_seconds_total[2m]) / rate(node_disk_writes_completed_total[5m]) - Average latency for writes.

The above metrics report the number by node/device. Usually, you will see IO against the following devices:

vda = The node’s primary device that is used for pod’s local storage

vdd, vde, vdf, vdg = Attached persistent volumes

Example output from one of our tests can be seen below:

A screenshot of a computer

Description automatically generated with medium confidence

The above queries break down the IO by node/device, but it may also be useful to break down the IO by pod. The following queries can be used to see this:

sum by (pod) (irate(container_fs_writes_total{pod!=""}[2m])) – The number of writes per second broken down by pod – can be useful to determine which pods are doing significant IO

sum by (pod) (irate(container_fs_reads_total{pod!=""}[2m])) - The number of reads per second broken down by pod

A screen shot of a graph

Description automatically generated with medium confidence

A screen shot of a computer

Description automatically generated with low confidence

All the above queries only work for block storage, if using file storage you can use the following query to get a breakdown by node:

irate(node_nfs_requests_total[5m])

There are also some pre-defined dashboards that can show useful information about the IO occurring. In the Openshift console navigate to Observe -> Dashboards.

The Kubernetes / Compute Resources dashboards are useful for seeing which namespaces/pods are doing significant amounts of IO:

A screenshot of a computer

Description automatically generated with low confidence

The Node Exporter / USE Method ones are useful for breaking the IO down by Node/device.

A screenshot of a computer

Description automatically generated with low confidence

Note, these dashboards will only capture IO against block storage, and will not count IO against File storage-based PVCs.

Conclusion

In this article we have described the factors that affect the throughput/performance you can expect when using persistent storage on VPC clusters running on use Red Hat Openshift on IBM Cloud.

The key takeaways are:

·      You can select a size and IOPS of persistent storage, but there are other factors that determine if you will reach this limit.

·      Different size worker nodes have different bandwidth allocations, which can also limit persistent storage throughput/performance.

·      Bandwidth allocation on a node is spread between all attached volumes, so additional volumes can affect throughput/performance of existing volumes.

·      We have shared some tools and techniques for measuring and monitoring IO performance which can be used when trying to understand IO performance

Additional Resources

To get started with persistent storage on Red Hat Openshift on IBM Cloud see https://cloud.ibm.com/docs/openshift?topic=openshift-storage-plan

The documentation at https://cloud.ibm.com/docs/openshift?topic=openshift-vpc-block#vpc-block-reference(For VPC block storage) and https://cloud.ibm.com/docs/openshift?topic=openshift-storage-file-vpc-sc-ref (For VPC File storage) gives you a description of the different storage classes.

The infrastructure documentation at https://cloud.ibm.com/docs/vpc?topic=vpc-block-storage-profiles&interface=ui (VPC Block storage)  and https://cloud.ibm.com/docs/vpc?topic=vpc-file-storage-profiles&interface=ui (VPC File Storage) gives more details on the IOPS and Bandwidth you can expect from the different Storage classes.

1 comment
50 views

Permalink

Comments

Wed September 20, 2023 11:13 AM

Thanks Dan for a very informative article. Very helpful in analyzing storage volume performance.