Decision Management (ODM, ADS)

Connect with experts and peers to elevate technical expertise, solve problems and share insights

View Only

Back to Blog List

Performance Check List of OCP for CP4BA

By Pierre-Andre Paumelle posted Thu June 03, 2021 04:23 AM

Introduction

Moving to production requires to take care of some core performance aspects, so that you avoid to troubleshoot afterward. Here is a list of check points we've gathered when planning to use the Cloud Pak for Business Automation, with a focus on :

- Operational Decision Manager

- Automation Decision Service

Both sending events to Business Automation Insights.

For more details on those components and capabilities, please refer to CP4BA documentation

This article takes the performance tester point of view and split checkpoints into several categories:

Hardware
Networking
Kafka
Elastic Search
Flink
Runtimes
Storage
Performance tests

Hardware

Checkpoint:

Verify that the hardware behind is not obsolete

Procedure:

Open a Terminal on a Pod or a Worker Node
cat /proc/cpuinfo
You should also verify that your cluster is homogeneous at hardware level.

Networking

Network performance is key especially with a micro service architecture like the CP4BA one.

In this category, we target Router configuration, Route annotations and Load balancing settings.

Router

Checkpoint:

Verify that the router configuration is correctly sized. To do this you have to check the number of router pods and verify the CPU consumption of the routers pods..

Procedure:

In the OpenShift console

Home > Search > IngressController > all projects > replicas (default was 2) => updated to 5

HAProxy

Checkpoint:

Verify the HAProxy configuration to ensure that the load is correctly balanced in round-robin and with the right number of nbproc.
Of course, here we check the Apache HAProxy as load balancer but this checkpoint applies to your load balancing solution, but with your procedure though.

Procedure:

Find the HAProxy node address
ssh root@haproxy address
Look into the haproxy.cfg

vi /etc/haproxy/haproxy.cfg
check ingress-https backend from "balance source" to "balance roundrobin"
check npproc value to 5 (default is 1; here we advise to augment it to 5)

Restart HAProxy

systemctl daemon-reload
systemctl restart haproxy
(optional) systemctl status haproxy

Route Annotation for ADS and ODM runtimes

Checkpoint:

Verify that the load is balanced between every runtimes.
This annotation is key when the application which calls the decision services is running on a limited set of addresses.

Procedure:

Open the routes of ADS and ODM
add annotation haproxy.router.openshift.io/balance roundrobin

Kafka

Kafka is storing messages in files. These files could be split in partitions.

This split enable a concurrent access to the messages which is key for BAI performance.

Checkpoint:

Verify the number of partitions and the retention duration on every Kafka topics used by your BAI.
The horizontal scalability of BAI depends on Kafka topics configuration.
The main point is the number of partitions per topic.
By default there is a single partition which limits the performance of BAI and the retention duration is set to one week.
In our case, we need 12 partitions, and the retention duration is limited to 2 hours. Here is how we will tell it to OpenShift with an extract of the kafka-topic.yaml:

apiVersion: ibmevents.ibm.com/v1beta1
kind: KafkaTopic
metadata:
name: icp4ba-bai-ingress
spec:
config:
retention.ms: 7200000
partitions: 12
replicas: 3
topicName: icp4ba-bai-ingress

Procedure:

pod "iaf-system-kafka-0" deleted

pod "iaf-system-kafka-1" deleted

pod "iaf-system-kafka-2" deleted

kafkatopic.ibmevents.ibm.com "icp4ba-bai-ingress" deleted

kafkatopic.ibmevents.ibm.com "icp4ba-bai-egress" deleted

kafkatopic.ibmevents.ibm.com "icp4ba-bai-service" deleted

kafkatopic.ibmevents.ibm.com "icp4ba-bai-ads-decision-execution-common-data" deleted

kafkatopic.ibmevents.ibm.com "icp4ba-bai-bvt-ingress" deleted

kafkatopic.ibmevents.ibm.com/icp4ba-bai-ingress created

kafkatopic.ibmevents.ibm.com/icp4ba-bai-egress created

kafkatopic.ibmevents.ibm.com/icp4ba-bai-service created

kafkatopic.ibmevents.ibm.com/icp4ba-bai-ads-decision-execution-common-data created

kafkatopic.ibmevents.ibm.com/icp4ba-bai-bvt-ingress created

Restart Flink Task Managers

Delete all the task manager to force a re-initialization

Elastic Search

Elastic Search is storing Time Series in Shards.

By default BAI creates Time Series with a single Shard, from a performance stand point increases the number of Shards per Time Serie improves the scalabilty and the insert throughput.

Shards

Checkpoint:

When a time serie inside Elastic Search contains millions of documents it is mandatory to have several shards.
The default configuration of BAI is limited to a single shard per time serie.

Procedure:

Add Elasticsearch shards to improve the reliability and performance.

Here is a sample which increases the number of shards to 12.

export EK_USER=$(oc get secret icp4ba-es-auth -o json | jq -cr '.data.username' | base64 --decode)
export EK_PASS=$(oc get secret icp4ba-es-auth -o json | jq -cr '.data.password' | base64 --decode)

oc get routes | grep iaf-system-es | awk '{print $2}'
curl -ku "$EK_USER:$EK_PASS" https://<cluster-Address>:<port>
curl -ku "$EK_USER:$EK_PASS" -X POST https://<cluster-Address>:<port>/<timeseries>/_rollover -d'{"aliases":{"<timeseries>":{}},"settings":{"index.number_of_shards":12}}' -H 'Content-Type: application/json'

Memory configuration

Checkpoint:

The Elastic Search pod created by IAF contains a request memory of 5Gb.
The heap size used by Elastic Search is by default at 1Gb. It should be smaller than the request memory of the pod.
In our case the request memory is at 5Gb so we should modify The Elastic Search to use 4G of heap

Procedure:

You can follow the documented procedure here: https://www.ibm.com/docs/en/cloud-paks/1.0?topic=configuration-operational-datastore#jvm-options

Use the open Shift console and access to the stateful sets
Add in stateful Set of Elastic Search Data the following environment variable

ES_JAVA_OPTS -Xms4g -Xmx4g

Flink

To improve the BAI performance you should increase the parallelism of flink taskmanagers per Pillar.

The horizontal scalability of Flink is enable by the usage of partitions of the Kafka topics.

install flink UI route to verify the Flink taskmanager status.

Procedure:

Create a route to the service which contains bai-event and jobmanager in the name
The route should be passthrough
The user and password to use to access the Web UI of Flink are available in a secret which contains bai-event and admin-user
With this UI you can see the status of the task managers

Runtimes

In order to get best performance for ODM and ADS runtimes you have to consider CPU and memory settings.

We advise to check CPU/memory requests AND limits, and their values.

Checkpoint:

Verify that CPU/memory requests AND limits for ODM and ADS runtimes are equals

Procedure:

This is set inside the CR

Storage

Kafka and Elastic Search need a storage with a high level of performance.

Usage of SSD is a must

The monitoring of the Kafka and Elastic Search disks is mandatory.

A disk full exception could provoke a data loss.

Performance tests

You might want to verify you have sufficient performance results, and for this, you might already have invested in a performance tool like JMeter.

JMeter is good at "injecting" requests to leverage runtimes.

In this case, you should take care of the JMeter process behavior and its possible network latency that could impact performance results.

JMeter

Checkpoint:

Verify JMeter process usage (CPU/RAM)
CPU should not reach 80%
Check the level of network usage should not reach 75%.

Procedure:

run "top" for cpu/ram usage when JMeter is running

Latency

Checkpoint:

Verify the latency between the machine where JMeter is executed and the OpenShift cluster under test

Procedure:

ping the cluster from the bench machine
check that the average round trip is under 50ms.

Take Away

There is a useful troubleshooting documentation : https://www.ibm.com/docs/en/cloud-paks/cp-biz-automation/21.0.x?topic=specifics-troubleshooting

The SLA depends on the tuning of your installation and on the hardware performance (SSD disk, processor..).

Add partitions in every Kafka topics and set the retention duration to another value than one week.

oc get pods | grep kafka | awk '{print $1}' | xargs oc delete pods && oc delete -f kafka-topic.yaml && oc create -f kafka-topic.yaml

Check the output, which looks like this, showing that everything is ok:

#CloudPakforBusinessAutomation
#migration
#OperationalDecisionManager(ODM)

1 comment

92 views

Permalink

https://community.ibm.com/community/user/blogs/pierre-andre-paumelle1/2021/06/03/performance-check-list-of-ocp-for-cp4ba

Comments

Pierre-Andre Paumelle

Thu June 03, 2021 04:44 AM

This check list is based on 21.0.1 version of CP4BA

Decision Management (ODM, ADS)

Decision Management (ODM, ADS)

Performance Check List of OCP for CP4BA

By Pierre-Andre Paumelle posted Thu June 03, 2021 04:23 AM

Permalink

Comments

Additional
Resources

Office

Quick Links

Decision Management (ODM, ADS)

Decision Management (ODM, ADS)

Performance Check List of OCP for CP4BA

By Pierre-Andre Paumelle posted Thu June 03, 2021 04:23 AM

Permalink

Comments

Additional Resources

Office

Quick Links

Additional
Resources