IBM Cloud Pak System

 View Only

RedHat OpenShift Container Platform deployment in IBM Cloud Pak System: Best practices

By ANSHU Garg posted Mon November 15, 2021 08:28 AM

  

Red Hat OpenShift Container Platform (referred as OCP in article hereafter) is a hybrid cloud platform built on containers and Kubernetes. OpenShift Container Platform simplifies and accelerates the development, delivery, and lifecycle management of a hybrid mix of applications, consistently anywhere across on-premise, public clouds, and edge.

IBM Cloud Pak System (referred as CPS in article hereafter) speeds up the implementation of on-premises Kubernetes platforms. It comes with automated deployment and configuration of RedHat Openshift Container Platform.

Together CPS and OCP radically accelerate organizations’ Kubernetes adaption and simplify the application modernization journey. CPS provides fully automated OCP UPI deployment on VMWare experience which stands up an OCP cluster in less than ~100 minutes with capability to have storage compute nodes (OpenShift Container Storage) deployed with different resources than usual workload compute nodes.

In this article I’ll talk about best practices to consider before starting your OCP cluster deployment in CPS and some troubleshooting tips. Most of these can be applied to any OCP UPI installation on VMWare.

  • Hardware resources
  • Storage
  • etcd defragmentation
  • Network bandwidth



First and foremost is getting the right hardware resources sizing for your business use-case. Red Hat has documented minimum resource requirement but almost always that is not sufficient for any production workload. Also CPS OCP accelerator provides default cluster size to be minimal.

Node Type # Of VMs virtual CPUs RAM (GB) Storage (GB)
Control (master) 3 4 16 120
Compute (worker) 2 2 8 120


Plan your initial deployment by sizing your business application(s). Red Hat best practices serves an excellent starting point. Failing to do so will result in unpredictable OCP cluster degradation issues including but not limited to severe performance issue and rendering cluster unusable. In such cases vertical scaling may be of help.

Refer to Red Hat cluster maximums that serve as benchmark for OCP workload capacity. Plan your cluster size based on applications or certified cluster maximums before deploying cluster in CPS.



Next critical aspect is storage. Performance of your etcd cluster is strongly impacted by the storage used. etcd exports some metrics to visualize storage performance. One of them is wal_fsync_duration_seconds. According to etcd docs the 99th percentile of this metric should be less than 10ms for storage to be considered fast enough. To benchmark your underlying storage use fio or tools alike to measure storage performance. Below given is an example

fio — name=writeiops — rw=write — size=22m — ioengine=sync — fdatasync=1 — bs=2300 — directory=/tmp/fio/
writeiops: (g=0): rw=write, bs=(R) 2300B-2300B, (W) 2300B-2300B, (T) 2300B-2300B, ioengine=sync, iodepth=1

.......
.......
write: IOPS=2545, BW=5716KiB/s (5854kB/s)(22.0MiB/3941msec)
clat (usec): min=2, max=750, avg=126.67, stdev=115.80
lat (usec): min=2, max=751, avg=126.78, stdev=115.80
clat percentiles (usec):
| 1.00th=[ 3], 5.00th=[ 3], 10.00th=[ 3], 20.00th=[ 3],
| 30.00th=[ 4], 40.00th=[ 5], 50.00th=[ 169], 60.00th=[ 221],
| 70.00th=[ 229], 80.00th=[ 235], 90.00th=[ 245], 95.00th=[ 253],
| 99.00th=[ 359], 99.50th=[ 388], 99.90th=[ 457], 99.95th=[ 586],
| 99.99th=[ 717]
bw ( KiB/s): min= 5552, max= 5812, per=99.92%, avg=5711.57, stdev=85.54, samples=7
iops : min= 2472, max= 2588, avg=2543.14, stdev=38.14, samples=7
lat (usec) : 4=37.06%, 10=6.77%, 20=0.02%, 100=1.65%, 250=48.30%
lat (usec) : 500=6.12%, 750=0.08%, 1000=0.01%
fsync/fdatasync/sync_file_range:
sync (usec): min=187, max=2359, avg=264.61, stdev=58.87
sync percentiles (usec):
| 1.00th=[ 198], 5.00th=[ 204], 10.00th=[ 208], 20.00th=[ 215],
| 30.00th=[ 219], 40.00th=[ 227], 50.00th=[ 241], 60.00th=[ 302],
| 70.00th=[ 310], 80.00th=[ 318], 90.00th=[ 326], 95.00th=[ 334],
| 99.00th=[ 355], 99.50th=[ 392], 99.90th=[ 644], 99.95th=[ 775],
| 99.99th=[ 1237]
cpu : usr=0.76%, sys=6.95%, ctx=15661, majf=0, minf=35
IO depths : 1=200.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
.......
.......


In CPS Gen3/4 we have benchmarked etcd cluster performance with fio. Average IOPS in CPS for any OCP cluster vary between 1600–2300. It can vary from cluster to cluster but gives a reference for OCP clusters running on CPS.

One important point to keep in mind is that when hardware resources are insufficient in OCP cluster it may give perception of degraded etcd cluster as result of slow storage but that is not always true. While low IOPS are an indication of degraded etcd cluster, reason can be lower resources on etcd nodes giving an impression of storage bottleneck. To identify this situation use etcd metrics in Prometheus etcd_server_leader_changes_seen_total which identifies etcd leader switch in recent past. If too many and frequent leader switches are seen it may be caused by high etcd network latency or resource issues and not storage.



Another good practice for maintaining your OCP cluster in CPS is to monitor etcd database size and periodic
 etcd defragmentation. Over the period based on running workload and objects created in cluster etcd database can grow and cause performance degradation. It is recommended to defragment etcd database in such cases. Refer to this document for detailed steps. Simplified version for OCP cluster running in CPS are given below.

  • Get etcd pods list. Figure1 shows sample example from command.
            oc get pods -n openshift-etcd -o wide | grep -v quorum-guard | grep etcd
etcd pods
Figure 1

  • Run following command on any of the three pods listed in step 1. Figure 2 shows sample output from command.
            oc rsh -n openshift-etcd etcd-master-3425.cps-rack-79-vm-113.rtp.raleigh.ibm.com etcdctl endpoint status - cluster -w table

etcd table
Figure 2

In output shown in Figure 2, find out the container which has IS LEADER column set as true and take note of ENDPOINT value for that container. Match that end point value against output from step 1 and find out pod with same end point (IP value only). In Figure 2, it is highlighted in Blue.



  • Now connect to the pods which are not leader. pods 1st and 3rd from output of our step1 in our case, highlighted in Violet in Figure 1 one by one and execute below commands.
  • Once step3 has been completed for 2 non leader pods, do same steps Leader pod, pod 2 in our output of step1

Last in this article we’ll see importance of network bandwidth for connected deployment. For online OCP deployment in CPS, OCP docker images are pulled from Red Hat and quay.io and it requires a high bandwidth else installation fails due to time out in pulling images over slow connection.

While there is no documented minimum speed requirement from Red Hat, we have observed that download speed of 50Mbit/s works good for online OCP deployment in CPS. You can check speed in your CPS system by deploying a RHEL VM and running below command


wget -O speedtest-cli https://raw.githubusercontent.com/sivel/speedtest-cli/master/speedtest.py

chmod +x speedtest-cli

./speedtest-cli

Retrieving speedtest.net configuration…
Testing from Megacable (177.229.67.255)…
Retrieving speedtest.net server list…
Selecting best server based on ping…
Hosted by TUUNET (Juchipila) [158.44 km]: 36.973 ms
Testing download speed……………………………………………………………………..
Download: 57.93 Mbit/s
Testing upload speed……………………………………………………………………………………
Upload: 93.00 Mbit/s



With these small but critical practices you can deploy and manage OCP cluster in your CPS with ease.

0 comments
15 views

Permalink