Decision Management (ODM, ADS)

Connect with experts and peers to elevate technical expertise, solve problems and share insights

View Only

Back to Blog List

Performance Check List of OCP for CP4BA 23.0.1

By NICOLAS PEULVAST posted Thu July 20, 2023 06:01 AM

Target audience: Performance Tester with Administrator role

Estimated duration: 120 minutes

A downloadable version of this Blog Post is available here.

Moving to production requires to take care of some core performance aspects, so that you avoid troubleshooting afterward.

Here is a list of check points we've gathered when planning to use the Cloud Pak for Business Automation 23.0.1, with a focus on:

- Operational Decision Manager (ODM)

- Automation Decision Service (ADS)

In version 23.0.1, only ODM sends events to Business Automation Insights (BAI).

For more details on those components and capabilities, please refer to CP4BA documentation.

This article takes the performance tester point of view and split checkpoints into several categories:

Hardware
Networking
Kafka
Elasticsearch
Flink
Runtimes
Storage
Performance tests

Hardware

Checkpoint

Verify that the hardware behind is not obsolete.

Procedure

Open a Terminal on a Pod or a Worker Node
Launch command: cat /proc/cpuinfo
You should also verify that your cluster is homogeneous at hardware level
The bogomips information might be a good reference to compare your environments

Kubectl command line example

kubectl exec icp4adeploy-odm-decisionserverruntime-69b8d46c77-jqfv4 -n dba2301 -- cat /proc/cpuinfo

Networking

Network performance is key especially with a micro service architecture like the CP4BA one.

In this category, we target Router configuration, Route annotations and Load balancing settings.

Router

Checkpoint

Verify that the router configuration is correctly sized. To do this you have to check the number of router pods and verify the CPU consumption of the router's pods.

Procedure

In the OpenShift console
Open: Home > Search > IngressController > all projects > default > replicas (default was 2)
Updated to 5 replicas if needed

Kubectl command line example

Get the number of replicas for the default Ingress Controller:

kubectl get IngressController default -n openshift-ingress-operator -o=jsonpath='Replicas: {.status.availableReplicas}{"\n"}'

Increase the number of replicas for the default Ingress Controller:

kubectl patch IngressController default -n openshift-ingress-operator --type=json -p '[{ "op": "replace", "path": "/spec/replicas", "value": 5 }]'

HAProxy

Checkpoint

Verify the HAProxy configuration to ensure that the load is correctly balanced in round-robin and with the right number of nbproc.
Of course, here we check the Apache HAProxy as load balancer, but this checkpoint applies to your load balancing solution, but with your procedure though.

Procedure

Find the HAProxy node address
Use command line: ssh root@haproxy address
Look into the haproxy.cfg

Use command line: vi /etc/haproxy/haproxy.cfg
check ingress-https backend from "balance source" to "balance roundrobin"
check nbproc value to 5 (default is 1; here we advise to augment it to 5)

Restart HAProxy

Use command line: systemctl daemon-reload
Use command line: systemctl restart haproxy
(optional) systemctl status haproxy

Route Annotation for ADS and ODM runtimes

Checkpoint

Verify that the load is balanced between every runtime.
This annotation is key when the application which calls the decision services is running on a limited set of addresses.

Procedure

Open the routes of ADS and ODM
Verify that the annotation haproxy.router.openshift.io/balance: roundrobin is well defined in the CPD route

Kubectl command line example

Get Route annotation from the Zen route:

kubectl get Route cpd -n dba2301 -o=jsonpath='Balance: {.metadata.annotations.haproxy\.router\.openshift\.io/balance}{"\n"}'

Upgrade the annotation if needed:

kubectl annotate route cpd -n dba2301 --overwrite haproxy.router.openshift.io/balance='roundrobin'

Kafka

Kafka is storing messages in files. These files could be split in partitions.

This split enables a concurrent access to the messages which is key for BAI performance.

Checkpoint

Verify the number of partitions and the retention duration on every Kafka topics used by your BAI.
The horizontal scalability of BAI depends on Kafka topics configuration.
The main point is the number of partitions per topic.
By default, there is a single partition which limits the performance of BAI and the retention duration is set to one week.

In our case, we need 12 partitions, and the retention duration is limited to 2 hours. Here is how we will tell it to OpenShift with an extract of the kafka-topic.yaml:

apiVersion: ibmevents.ibm.com/v1beta1
kind: KafkaTopic
metadata:
  name: icp4ba-bai-ingress
spec:
  config:
    retention.ms: 7200000
  partitions: 12
  replicas: 3
  topicName: icp4ba-bai-ingress

Procedure

Restart Flink Task Managers

Delete all the task manager to force a re-initialization

Kubectl command line example

Get Kafka key information for ODM:

kubectl get KafkaTopic icp4ba-bai-odm-ingress -n dba2301 -o=jsonpath='Partitions: {.spec.partitions}{"\n"}Retention: {.spec.config.retention\.ms}{" ms\n"}'

Get Kafka key information for other products:

kubectl get KafkaTopic icp4ba-bai-ingress -n dba2301 -o=jsonpath='Partitions: {.spec.partitions}{"\n"}Retention: {.spec.config.retention\.ms}{" ms\n"}'

Restart all Flink Task Managers:

for pod in $(kubectl get pods -n dba2301 | grep taskmanager | awk '{print $1}'); do kubectl delete pod $pod -n dba2301 --force --grace-period 0 ; done

Elasticsearch

Elasticsearch is storing Time Series in Shards.

By default, BAI creates Time Series with a single Shard, from a performance standpoint increasing the number of Shards per Time Series improves the scalability and the insert throughput.

Shards

Checkpoint

When a Time Series inside Elasticsearch contains millions of documents it is mandatory to have several shards.
The default configuration of BAI is limited to a single shard per Time Series.

Procedure

Add Elasticsearch shards to improve the reliability and performance.
Here is a sample which increases the number of shards to 12.
1. Get user: export EK_USER=$(oc get secret iaf-system-elasticsearch-es-default-user -o json | jq -cr '.data.username' | base64 --decode)
2. Get password: export EK_PASS=$(oc get secret iaf-system-elasticsearch-es-default-user -o json | jq -cr '.data.password' | base64 --decode)
3. Get the URL: export CLUSTER_ADDRESS=$(oc get routes | grep iaf-system-es | awk '{print $2}')
4. Verify the version used: curl -ku "$EK_USER:$EK_PASS" https://$CLUSTER_ADDRESS
5. Change the shards number: curl -vgku "$EK_USER:$EK_PASS" -X POST "https://$CLUSTER_ADDRESS/icp4ba-bai-odm-timeseries-write-ibm-bai/_rollover?pretty" -d'{"aliases":{"icp4ba-bai-odm-timeseries-ibm-bai":{}},"settings":{"index.number_of_shards":12}}' -H 'Content-Type: application/json'

Additional documentation is available here.

Kubectl command line example

Get the username for ElasticSearch:

kubectl get secret iaf-system-elasticsearch-es-default-user -n dba2301 -o=go-template='{{.data.username|base64decode}}{{"\n"}}'

Get the password for ElasticSearch:

kubectl get secret iaf-system-elasticsearch-es-default-user -n dba2301 -o=go-template='{{.data.password|base64decode}}{{"\n"}}'

Get the route information for ElasticSearch:

kubectl get routes -n dba2301 | grep iaf-system-es | awk '{print $2}'

Memory configuration

Checkpoint

The Elasticsearch pod created by IAF contains a request memory of 5Gb.
The heap size used by Elasticsearch is by default at 1Gb. It should be smaller than the request memory of the pod.
In our case the request memory is at 5Gb so we should modify The Elasticsearch to use 4G of heap

Procedure

You can follow the documented procedure here: https://www.ibm.com/docs/en/cloud-paks/1.0?topic=configuration-operational-datastore#jvm-options

Use the open Shift console and access to the stateful sets
Add in stateful Set of Elasticsearch Data the following environment variable

ES_JAVA_OPTS -Xms4g -Xmx4g

Kubectl command line example

Adding an additional env property in ElasticSearch (unless the iaf-system change it again):

kubectl patch StatefulSets iaf-system-elasticsearch-es-data -n dba2301 --type=json -p '[{ "op": "add", "path": "/spec/template/spec/containers/0/env/0", "value": {name: "ES_JAVA_OPTS", value: "-Xms4g -Xmx4g"} }]'

Flink

To improve the BAI performance, you should increase the parallelism of Flink Task Managers per Pillar.

The horizontal scalability of Flink is enable by the usage of partitions of the Kafka topics.

Checkpoint

install Flink UI route to verify the Flink taskmanager status.

Procedure

Change the Custom Resource of CP4BA following the documentation here.
i.e. For a route to the Flink web interface to be automatically created, set the spec.bai_configuration.flink.create_route parameter to true in the custom resource.
The user and password to use to access the Web UI of Flink are available in a secret which contains bai-event and admin-user
With this UI you can see the status of the task managers

Kubectl command line example

Changing the CR to create a route for the Flink Web UI:

kubectl patch ICP4AClusters dba2301bai -n dba2301 --type=merge -p '{"spec":{"bai_configuration":{"flink":{"create_route": true}}}}'

Get the secret than contains the credential to connect to the Flink Web UI:

kubectl get routes -n dba2301 | grep iaf-system-es | awk '{print $2}'kubectl get InsightsEngine dba2301bai -n dba2301 -o jsonpath='{.status.components.flinkUi.endpoints[?(@.scope=="External")].authentication.secret.secretName}{"\}'

Get the username:

kubectl get secret dba2301bai-insights-engine-flink-admin-user -n dba2301 -o=go-template='{{.data.username|base64decode}}{{"\n"}}'

Get the password:

kubectl get secret dba2301bai-insights-engine-flink-admin-user -n dba2301 -o=go-template='{{.data.password|base64decode}}{{"\n"}}'

Get the URI of the Flink Web UI (you have to login to Zen CPD first and then use the Flink credentials):

kubectl get routes -n dba2301 | grep iaf-system-es | awk '{print $2}'kubectl get InsightsEngine dba2301bai -n dba2301 -o jsonpath='{.status.components.flinkUi.endpoints[?(@.scope=="External")].uri}{"\}'

Runtimes

To get best performance for ODM and ADS runtimes you must consider CPU and memory settings.

We advise to check CPU/memory requests AND limits and align their values.

Go to the relevant folder in cert-kubernetes/descriptors/patterns to find all of the templates:

For ODM, use the fully customizable decisions template, ibm_cp4a_cr_production_FC_decisions.yaml, to copy lines from and paste them into your CR file.

For ADS, use the fully customizable decisions template, ibm_cp4a_cr_production_FC_decisions_ads.yaml, to copy lines from and paste them into your CR file.

For more information about downloading cert-kubernetes, see Preparing a client to connect to the cluster.

Checkpoint

Verify that CPU/memory requests AND limits for ODM and ADS runtimes are equals

Procedure

This is set inside the CR

Kubectl command line example

Consult the actual Decision Server Runtime configuration:

kubectl get ICP4AClusters dba2301bai -n dba2301 -o=jsonpath='Decision Server Runtime Config: {.spec.odm_configuration.decisionServerRuntime}{"\n"}'

Consult the actual Decision Server Console configuration:

kubectl get ICP4AClusters dba2301bai -n dba2301 -o=jsonpath='Decision Server Console Config: {.spec.odm_configuration.decisionServerConsole}{"\n"}'

Consult the actual Decision Center configuration:

kubectl get ICP4AClusters dba2301bai -n dba2301 -o=jsonpath='Decision Center Console Config: {.spec.odm_configuration.decisionCenter}{"\n"}'

Consult the actual Decision Runner configuration:

kubectl get ICP4AClusters dba2301bai -n dba2301 -o=jsonpath='Decision Runner Config: {.spec.odm_configuration.decisionRunner}{"\n"}'

Same patterns apply for ADS.

Storage

Kafka and Elasticsearch need a storage with a high level of performance.

Usage of SSD is a must.

The monitoring of the Kafka and Elasticsearch disks is mandatory.

A disk full exception could provoke a data loss.

Performance tests

You might want to verify you have sufficient performance results, and for this, you might already have invested in a performance tool like JMeter.

JMeter is good at "injecting" requests to leverage runtimes.

In this case, you should take care of the JMeter process behavior and its possible network latency that could impact performance results.

JMeter

Checkpoint

Verify JMeter process usage (CPU/RAM)
CPU should not reach 80%
Check the level of network usage should not reach 75%.

Procedure

Run "top" or "htop" for cpu/ram usage when JMeter is running to control that the limit reached is not on the JMeter side: you do not have to reach 100% of your CPU limit.
If you want to test your application without having the overhead of the network (session creation and authentication), prefer using:
1. The Basic Authentication method
2. The "Same user on each iteration" option checked on the Thread group in combination with an HTTP Cookie Manager
3. The "Use KeepAlive" option checked on the HTTP Request

Latency

Checkpoint

Verify the latency between the machine where JMeter is executed and the OpenShift cluster under test

Procedure

ping the cluster from the bench machine
check that the average round trip is under 50ms.

Take Away

There is a useful troubleshooting documentation : https://www.ibm.com/docs/en/cloud-paks/cp-biz-automation/23.0.1?topic=specifics-troubleshooting

#CloudPakforBusinessAutomation #migration #OperationalDecisionManager(ODM)

0 comments

58 views

Permalink

https://community.ibm.com/community/user/blogs/nicolas-peulvast/2023/07/20/performance-check-list-of-ocp-for-cp4ba-2301

Decision Management (ODM, ADS)

Decision Management (ODM, ADS)

Performance Check List of OCP for CP4BA 23.0.1

By NICOLAS PEULVAST posted Thu July 20, 2023 06:01 AM

Hardware

Checkpoint

Procedure

Kubectl command line example

Networking

Router

Checkpoint

Procedure

Kubectl command line example

HAProxy

Checkpoint

Procedure

Route Annotation for ADS and ODM runtimes

Checkpoint

Procedure

Kubectl command line example

Kafka

Checkpoint

Procedure

Kubectl command line example

Elasticsearch

Shards

Checkpoint

Procedure

Kubectl command line example

Memory configuration

Checkpoint

Procedure

Kubectl command line example

Flink

Checkpoint

Procedure

Kubectl command line example

Runtimes

Checkpoint

Procedure

Kubectl command line example

Storage

Performance tests

JMeter

Checkpoint

Procedure

Latency

Checkpoint

Procedure

Take Away

Permalink

Additional Resources

Office

Quick Links

Additional
Resources