Platform and Cloud Pak

Going to production with Business Automation Insights 20.0.3

By Anthony Damiano posted Fri January 22, 2021 04:31 AM

  

Introduction

So, you've been testing BAI in your demo environment and now you want to go to production with it? 
There are several things that you may have to consider, such as:
  • High Availability,
  • Security,
  • Large data storage requirements
In this article we demonstrate in a step-by-step manner how to achieve good performance under heavy loads, such as ODM sending 1000 10KB events per second to BAI.

Architecture of BAI

Let's quickly review BAI architecture for version 20.0.3 to better understand what we can do to make BAI even better.
The pipeline is the following: each CP4A components sends its own events to Kafka topics. The events are then processed (and transformed if necessary) by Flink, which puts the results into an Elasticsearch instance (or into your own data lake like HDFS or into another Kafka using Kafka egress). Finally, you can visualize the Elasticsearch data in Business Performance Center (or Kibana).

This article will help you to address the potential bottlenecks in this pipeline.
For instance, if you see a lot of "offset lag" (the difference between the total number of messages in Kafka and the number of messages that remain to be processed), you can tweak Flink & Elasticsearch to avoid this lag. If you see that there are event delays in Kafka, then you should review your Kafka configuration.

Steps to Install and Configure your Environment

1) Install Kafka

If you are using IBM Event Streams, you should pick the "Production 3 brokers" sample when provisioning the Kafka cluster via the Web Console of OpenShift.
This will allow you to have 3 Kafka brokers that use secure connections.
It's best to install Event Streams in the same namespace as BAI. This will result in a simpler configuration.
Keep in mind that you can also bring your own Kafka if you want. The same requirements apply; you'll need multiple brokers and secure connections for a production-ready environment.

2) Choose the optimal number of Flink parallelism

For each component, you can manually set the number of Flink processing jobs that are going to read the related Kafka topics. "Parallelism" is constrained by the number of partitions in the Kafka topics (parallelism is less than or equal to the number of Kafka Topic partitions).
It is configured in each component's CR file.
Here is an example for ODM:
bai_configuration:
odm:
   # The number of parallel instances (task slots) to use for running the processing job.
   # For High Availability, use at least 2 parallel instances.
    parallelism: 4
In our case, we chose a parallelism value of 4 to have around 250 transactions per second per Flink job.
If you are changing parallelism after installing BAI, you must delete the old Kubernetes job so that it can be recreated with the new configuration.
There is one job per component. For ODM the job is named <cr_instance_name>-bai-odm; delete it and wait for the CP4A operator to recreate it.

You will then see new pods named <cr_instance_name>-bai-flink-taskmanager-<number_of_the_pod>.
Having multiple Flink jobs running in parallel on the same Kafka topic will help speed up the ingestion of  events from Kafka.

3) Set the number of Kafka partitions per topic

We recommend to manually create the Kafka topics before installing BAI. This way, you'll have full control of the topic properties instead of relying on BAI pre-defined settings. 

The names of the topics are the same as those in your CR file (probably the same names as in your current demo environment), here is where you can set them.

bai_configuration:
settings:
   # If not set, topics with default names as below are created.
   ingressTopic: "{{ meta.name }}-ibm-bai-ingress"
   egressTopic: "{{ meta.name }}-ibm-bai-egress"
   serviceTopic: "{{ meta.name }}-ibm-bai-service"
  • Use at least 3 partitions per topic (one per broker) x parallelism value. In our case, we'll pick 12 partitions (3*4 flink jobs).
  • For the data retention, it's important to establish disk capacity requirements, while minimizing the risk of exceeding it. 
  • 3 Replicas, so that if a Kafka broker is down, the other 2 are capable of handling the load.
  • Each topic should have the same settings as the others.
By using these settings, you'll improve performance for Kafka clients, which in our case are the CP4A components.

4) Connect to an external Elasticsearch 

In some circumstances, you want to re-use your own Elasticsearch infrastructure, which is tailored to your needs and over which you have full control.
During our tests, we found that an external Elasticsearch performs better. The improved performance is due to attached disks being faster when compared to OCP cluster storage, which relies on NFS.
In our case, we picked a cluster with 3 master/data/client nodes, with a 2 HA Proxy load balancer in front of it, accessible via a Virtual IP.
This allows us to have HA for the Elasticsearch cluster. Each data node has a large disk capacity to store our events for a long period of time.
Configure your CR file in the following manner to enable BAI to use an external Elasticsearch:
bai_configuration:
elasticsearch:
   externalKibanaUrl: 'http://kibana.mycorp.org:5601'
   install: false
   username: admin
   password: password
   serverCertificate: >-
     LS0tLS1CRUQ0FURS0tLS0t
   url: 'https://myloadbalancer.mycorp.org:9200'

5) Configure your CR to use Kafka correctly and launch BAI install

Since you have multiple Kafka brokers, you'll have to configure BAI to use all of them.
You can either configure the full list of Kafka brokers or you can configure a single bootstrap url. This will allow the event emitters to take advantage of Kafka horizontal scaling without issues.
If you have installed IBM Event Streams in the same namespace as BAI, you can use the following script to automatically configure your CR with Kafka informations.

6) Set the number of shards per index 

Now that you have installed BAI, you have to increase the number of shards for Elasticsearch indexes.
The number of shards depends on the number of events per day that you expect to process (event rate).
In our case, we have ODM sending 1000 events per second during a typical workday (8H) and each message is 10KB.This gives us a total of ~28.8 GB per day * nb replicas-1 (=2). According to the Elasticsearch reference, it's best to have less than 20 GB per active shard. We also want at least 1 shard per data node so that they are all used when making queries on the data; so we pick 3 shards in our case.
To do this, after BAI has been installed, you must edit the index template that was generated by the BAI installation.
To get the list of templates, you can query your cluster with the following command:
"curl -k -X GET https://admin:passwordmycluster:9200/_template".
You should see one template per component.
For each component that you are using, you must save the component template to a file.
Example: 
curl -k -X GET https://admin:password@mycluster:9200/_template/odm-timeseries > template.json
It should look like this:
{"odm-timeseries":{"order":0,"index_patterns":["odm-timeseries*"],"settings":{},"mappings":{"_meta":{"updateDate":"2020-12-03T16:41:32.007Z","version":"3.0.4"},"dynamic":true,"properties":{"duration":{"type":"long"},"odmType":{"type":"keyword"},"trace":{"properties":{"task":{"properties":{"names":{"type":"text","fields":{"keyword":{"ignore_above":256,"type":"keyword"}}},"durations":{"type":"long"}}},"rule":{"properties":{"names":{"type":"text","fields":{"keyword":{"ignore_above":256,"type":"keyword"}}},"durations":{"type":"long"}}}}},"rulesetPath":{"type":"keyword"},"id":{"type":"keyword"},"type":{"type":"keyword"},"version":{"type":"keyword"},"errors":{"type":"text","fields":{"keyword":{"ignore_above":256,"type":"keyword"}}},"timestamp":{"type":"date"}}},"aliases":{}}}
Then edit the file template.json, and set the number of shards in the "settings" part.
Example:
"settings": {
      "number_of_shards": 3
    },
Plus, remove the template name at the start (and a '}' at the end).
It should look like this:
{"order":0,"index_patterns":["odm-timeseries*"],"settings":{"number_of_shards": 3},"mappings":{"_meta":{"updateDate":"2020-12-03T16:41:32.007Z","version":"3.0.4"},"dynamic":true,"properties":{"duration":{"type":"long"},"odmType":{"type":"keyword"},"trace":{"properties":{"task":{"properties":{"names":{"type":"text","fields":{"keyword":{"ignore_above":256,"type":"keyword"}}},"durations":{"type":"long"}}},"rule":{"properties":{"names":{"type":"text","fields":{"keyword":{"ignore_above":256,"type":"keyword"}}},"durations":{"type":"long"}}}}},"rulesetPath":{"type":"keyword"},"id":{"type":"keyword"},"type":{"type":"keyword"},"version":{"type":"keyword"},"errors":{"type":"text","fields":{"keyword":{"ignore_above":256,"type":"keyword"}}},"timestamp":{"type":"date"}}},"aliases":{}}
Finally, push the new template into your Elasticsearch using your own admin credentials:
curl -H "Content-Type: application/json" -k -X PUT https://admin:passw0rd@mycluster:9200/_template/odm-timeseries -d@template.json
If everything went well, you should see the following message: 
{"acknowledged":true}
If something went wrong, it's probably because the json being sent is malformed.
From now on, for the new indexes created by BAI, you should have the correct number of shards. Note that due to an Elasticsearch limitation, you cannot edit the number of shards for an already existing index. So, either delete the already created indexes if you don't care about the data that they contain, or re-index all the indexes.

7) Configure Business Performance Center with UMS teams

You can set fine permissions on the data presented in BPC. By creating a team of administrator users, you can then use an admin account to set the permissions for other teams.

a) Create a team in UMS following the Knowledge center procedure: https://www.ibm.com/support/knowledgecenter/en/SSYHZ8_20.0.x/com.ibm.dba.offerings/topics/con_ums_teams_manage.html

b) Modify your CR to set the UMS team as BPC admin, and make sure to not allow everyone

bai_configuration:
  businessPerformanceCenter:
   # The UUID identifier, which is taken from UMS, of the team that you nominate to be the administration team
   # for Business Performance Center.
    adminTeam: "<the UMS team uuid>"
    allUsersAccess: false
After applying the new CR, you should see BPC pod restarting after the operator reconciliation.
This means that you now have an administrator team inside UMS.
c) Finally, log into BPC with an administrator account and set the permissions for the other users by following the Knowledge center procedure: https://www.ibm.com/support/knowledgecenter/en/SSYHZ8_20.0.x/com.ibm.dba.bai/bpc_topics/tsk_bai_bpc_permissions.html

Summary

In summary, here is what we've done to tune our pipeline:
We hope that this guide will help you achieve your BAI performance and scalability goals in your production environment.
I couldn't write this article without the collective effort of Johanne Sebaux, Nicolas Peulvast, Pierre-Andre Paumelle, Lionel Peron, Jose De freitas, Peter Gilliver and the whole BAI team.
If you still have some questions or issues, you can post here, we'll try to answer to your questions.

Anthony Damiano.

#BusinessAutomationInsights(BAI)
#CloudPakforAutomation
0 comments
40 views

Permalink