Decision Management (ODM, ADS)

 View Only

Performance Check List of OCP for CP4BA

By Pierre-Andre Paumelle posted Thu June 03, 2021 04:23 AM


Moving to production requires to take care of some core performance aspects, so that you avoid to troubleshoot afterward. Here is a list of check points we've gathered when planning to use the Cloud Pak for Business Automation, with a focus on :
  - Operational Decision Manager
  - Automation Decision Service
Both sending events to Business Automation Insights.

For more details on those components and capabilities, please refer to CP4BA documentation

This article takes the performance tester point of view and split checkpoints into several categories:
  • Hardware
  • Networking
  • Kafka
  • Elastic Search
  • Flink
  • Runtimes
  • Storage
  • Performance tests

  • Verify that the hardware behind is not obsolete
  1. Open a Terminal on a Pod or a Worker Node
  2. cat /proc/cpuinfo
  3. You should also verify that your cluster is homogeneous at hardware level.

Network performance is key especially  with a micro service architecture like the CP4BA one
In this category, we target Router configuration, Route annotations and Load balancing settings.
  • Verify that the router configuration is correctly sized. To do this you have to check the number of router pods and verify the CPU consumption of the routers pods..
  1. In the OpenShift console
Home > Search > IngressController > all projects > replicas (default was 2) => updated to 5 
  • Verify the HAProxy configuration to ensure that the load is correctly balanced in round-robin and with the right number of nbproc
  • Of course, here we check the Apache HAProxy as load balancer but this checkpoint applies to your load balancing solution, but with your procedure though.
  1. Find the HAProxy node address
  2. ssh root@haproxy address
  3. Look into the haproxy.cfg
    1. vi /etc/haproxy/haproxy.cfg
    2. check ingress-https backend from "balance source" to "balance roundrobin"
    3. check npproc value to 5 (default is 1; here we advise to augment it to 5)
  4. Restart HAProxy
    1. systemctl daemon-reload
    2. systemctl restart haproxy
    3. (optional) systemctl status haproxy
Route Annotation for ADS and ODM runtimes
  • Verify that the load is balanced between every runtimes. 
  • This annotation is key when the application which calls the decision services is running on a limited set of addresses.
  1. Open the routes of ADS and ODM
  2. add annotation roundrobin 

Kafka is storing messages in files. These files could be split in partitions.
This split enable a concurrent access to the messages which is key for BAI performance.
  • Verify the number of partitions and the retention duration on every Kafka topics used by your BAI.
  • The horizontal scalability of BAI depends on Kafka topics configuration.
  • The main point is the number of partitions per topic. 
  • By default there is a single partition which limits the performance of BAI and the retention duration is set to one week. 
  • In our case, we need 12 partitions, and the retention duration is limited to 2 hours. Here is how we will tell it to OpenShift with an extract of the kafka-topic.yaml:
    • apiVersion:
    • kind: KafkaTopic
    • metadata:
    •   name: icp4ba-bai-ingress
    • spec:
    •   config:
    • 7200000
    •   partitions: 12
    •   replicas: 3
    •   topicName: icp4ba-bai-ingress
pod "iaf-system-kafka-0" deleted
pod "iaf-system-kafka-1" deleted
pod "iaf-system-kafka-2" deleted "icp4ba-bai-ingress" deleted "icp4ba-bai-egress" deleted "icp4ba-bai-service" deleted "icp4ba-bai-ads-decision-execution-common-data" deleted "icp4ba-bai-bvt-ingress" deleted created created created created created
  1. Restart Flink Task Managers
    1. Delete all the task manager to force a re-initialization 
Elastic Search 

Elastic Search is storing Time Series in Shards.
By default BAI creates Time Series with a single Shard, from a performance stand point increases the number of Shards per Time Serie improves the scalabilty and the insert throughput.
  • When a time serie inside Elastic Search contains millions of documents it is mandatory to have several shards.
  • The default configuration of BAI is limited to a single shard per time serie. 
  1. Add Elasticsearch shards to improve the reliability and performance. 
    1. Here is a sample which increases the number of shards to 12. 
      1. export EK_USER=$(oc get secret icp4ba-es-auth -o json  | jq -cr '.data.username' | base64 --decode)
      2. export EK_PASS=$(oc get secret icp4ba-es-auth -o json  | jq -cr '.data.password' | base64 --decode) 
  2. oc get routes | grep iaf-system-es | awk '{print $2}'
  3. curl -ku "$EK_USER:$EK_PASS" https://<cluster-Address>:<port>
  4. curl -ku "$EK_USER:$EK_PASS" -X POST https://<cluster-Address>:<port>/<timeseries>/_rollover -d'{"aliases":{"<timeseries>":{}},"settings":{"index.number_of_shards":12}}'  -H 'Content-Type: application/json' 
Memory configuration 
  • The Elastic Search pod created by IAF contains a request memory of 5Gb.
  • The heap size used by Elastic Search is by default at 1Gb. It should be smaller than the request memory of the pod.
  • In our case the request memory is at 5Gb so we should modify The Elastic Search to use 4G of heap
  1. Use the open Shift console and access to the stateful sets
  2. Add in stateful Set of Elastic Search Data the following environment variable
    1.  ES_JAVA_OPTS -Xms4g -Xmx4g

To improve the BAI performance you should increase the parallelism of flink taskmanagers per Pillar.
The horizontal scalability of Flink is enable by the usage of partitions of the Kafka topics. 
 install flink UI route to verify the Flink taskmanager status.
  1.  Create a route to the service which contains bai-event and jobmanager in the name
  2. The route should be passthrough
  3. The user and password to use to access the Web UI of Flink are available in a secret which contains bai-event and admin-user 
  4. With this UI you can see the status of the task managers
Flink dashboard

In order to get best performance for ODM and ADS runtimes you have to consider CPU and memory settings. 
We advise to check CPU/memory requests AND limits, and their values.
  • Verify that CPU/memory requests AND limits for ODM and ADS runtimes are equals
  1. This is set inside the CR 

Kafka and Elastic Search need a storage with a high level of performance.
Usage of SSD is a must
The monitoring of the Kafka and Elastic Search disks is mandatory.
A disk full exception could provoke a data loss
Performance tests

You might want to verify you have sufficient performance results, and for this, you might already have invested in a performance tool like JMeter.
JMeter is good at "injecting" requests to leverage runtimes. 
In this case, you should take care of the JMeter process behavior and its possible network latency that could impact performance results.
  • Verify JMeter process usage (CPU/RAM)
  • CPU should not reach 80% 
  • Check the level of network usage should not reach 75%.
  1. run "top" for cpu/ram usage when JMeter is running
  • Verify the latency between the machine where JMeter is executed and the OpenShift cluster under test
  1. ping the cluster from the bench machine
  2. check that the average round trip is under 50ms.
Take Away
The SLA depends on the tuning of your installation and on the hardware performance (SSD disk, processor..).
  1. Add partitions in every Kafka topics and set the retention duration to another value than one week.
    1. oc get pods | grep kafka | awk '{print $1}' | xargs oc delete pods && oc  delete -f kafka-topic.yaml && oc create -f kafka-topic.yaml
  2. Check the output, which looks like this, showing that everything is ok:

1 comment



Thu June 03, 2021 04:44 AM

This check list is based on 21.0.1 version of CP4BA