Decision Management (ODM, ADS)

 View Only

How to investigate a memory leak with CP4BA 22.0.1?

By NICOLAS PEULVAST posted Tue October 25, 2022 12:10 PM

  

Target audience: 
CP4BA user with Cluster Administrator role 
Estimated duration: 30 minutes


A downloadable version of this Blog Post is available here.

This blog post idea comes from the experience of a Memory Leak occurring on our Pre-Production of our ODM on Cloud offering, the 10th of May 2022.

The goal is to propose a solution to have some insight or idea of the possible root cause of a Memory Leak, occuring in the Cloud Pak for Business Automation.

On this article, we are focusing on the ADS Runtime Pod but this solution works for all Pods based on the official Liberty base image.

Tips: in order to know if a Pod of the Cloud Pak is using the Liberty base image, you can run the following command with your own pod name:

> oc exec -ti dba2201ads-ads-rest-api-64cfbf658c-b7klr -- [ -d "/opt/ibm/wlp/usr/servers/defaultServer" ] && echo "yey" || echo "nope"

If yey appears, then you can continue this article.

If nope appears, then this article will not work for you ... but you can read it anyway! ;)

Context

The visible evidence of a Memory Leak is an increasing consumption of Memory across hours or days, sometime very slowly.


In our case, there was no special activity registered on the Pre-Production environment during the leak.

Procedure for the Heap Dump


As the memory consumption increases regularly, do two snapshots, separated at least from 3 hours or more if possible. You do not have to reach the OutOfMemory on your pod during this timeframe.

1- First log on on the Cluster using the oc login command.

2- Get the name of your Pod that revealed the memory leak.

> oc get pods | grep ads-runtime-service
dba2202ads-ads-runtime-service-7d7b7777db-pnxf4 1/1 Running 0 7d18h
dba2202ads-ads-runtime-service-7d7b7777db-z97h4 1/1 Running 0 7d18h

 
Use the correct name for the other commands ...

3- Then, do a snapshot of your Heap using the following command line:


oc exec -ti dba2202ads-ads-runtime-service-7d7b7777db-pnxf4 -- /opt/ibm/wlp/bin/server dump --include=heap
Defaulted container "parsing-service" out of: parsing-service, tls-init (init)

Dumping server defaultServer.
Server defaultServer dump complete in /opt/ibm/wlp/output/defaultServer/defaultServer.dump-22.10.25_15.22.29.zip.


4- Copy the generated file from the pod to your local disk in order to ease the analysis.

> oc cp dba2202ads-ads-runtime-service-7d7b7777db-pnxf4:/opt/ibm/wlp/output/defaultServer/defaultServer.dump-22.10.25_15.22.29.zip defaultServer.dump-22.10.25_15.22.29.zip

Defaulted container "runtime-service" out of: runtime-service, tls-init (init)
tar: Removing leading `/' from member names


5- Finally remove this important Zip file from the Pod in order to be sure that the Pod is not using too much disk storage and having the risk to be evicted from the Cluster.

> oc exec -ti dba2202ads-ads-runtime-service-7d7b7777db-pnxf4 -- rm /opt/ibm/wlp/output/defaultServer/defaultServer.dump-22.10.25_15.22.29.zip

Defaulted container "runtime-service" out of: runtime-service, tls-init (init)

Tip for Ephemeral Storage


To do the Snapshot, one must be careful to stay in the pre-defined ephemeral storage limits from the Pod, if defined.

If an eviction of your Pod occurs during the Snapshot, then you can remove this constraint.

1- First, scale down the CP4BA Operator:

> oc scale deployment ibm-cp4a-operator --replicas=0
deployment.apps/ibm-cp4a-operator scaled


2- Finally remove the Ephemeral Storage constraint using the following command (we suppose that the jq command is available on your machine):

> oc get deployment dba2202ads-ads-runtime-service-7d7b7777db-pnxf4 -o json | jq 'walk(if type=="object" then del(."ephemeral-storage") else . end)' | oc replace --force -f -
deployment.apps "dba2202ads-ads-runtime-service" deleted
deployment.apps/dba2202ads-ads-runtime-service replaced


When the Snapshot are done, you have to scale up the CP4BA Operator.

> oc scale deployment ibm-cp4a-operator --replicas=1
deployment.apps/ibm-cp4a-operator scaled

Analysis of the Heap Dumps


At the end of the Procedure described above, you'll obtain 2 Zips of Heap Dump that we have to compare to find the growing objects/classes.

To do that, d
o a diff using the Eclipse Heap Dump Analyzer named "Memory Analyzer (MAT)" : https://www.eclipse.org/mat/

You'll probably need an additional extension for the IBM JDK => https://www.ibm.com/support/pages/eclipse-memory-analyzer-tool-dtfj-and-ibm-extensions

In our case, the difference reveals the following Class Names:


And in this case, you can notice that the DB2 Driver is responsible (com.ibm.db2.*) and also that the Crypto layer (sun.security.*, com.ibm.crypto.* and javax.crypto.*) is also included in that issue.


You can either find the corresponding issue in your custom code or find a possible reason directly in the web, such as
https://www.ibm.com/support/pages/apar/IJ39517 or involve the team in charge of this packages (ODM team, DB2 team or JDK team).



#CP4BA
#OperationalDecisionManager(ODM)
#performance
#CP4BA

Permalink