Decision Management (ODM, ADS)

 View Only

How to investigate a memory leak with CP4BA 23.0.1?

By NICOLAS PEULVAST posted Tue October 25, 2022 12:10 PM

  
Updated the 09 January 2024 for the 23.0.2 release version

Target audience: 
CP4BA user with Cluster Administrator role 
Estimated duration: 30 minutes


A downloadable version of this Blog Post is available here.

This blog post idea comes from the experience of a Memory Leak occurring on our Pre-Production of our ODM on Cloud offering, the 10th of May 2022.

The goal is to propose a solution to have some insight or idea of the possible root cause of a Memory Leak, occuring in the Cloud Pak for Business Automation.

On this article, we are focusing on the ADS Runtime Pod but this solution works for all Pods based on the official Liberty base image.

Tips: in order to know if a Pod of the Cloud Pak is using the Liberty base image, you can run the following command with your own pod name:

> oc login [...]
> oc project 
> oc exec -ti dba2302ads-ads-rest-api-6757765dd7-sf6w8 -- [ -d "/opt/ibm/wlp/usr/servers" ] && echo "yey" || echo "nope"

If yey appears, then you can continue this article.

If nope appears, then this article will not work for you ... but you can read it anyway! ;)

Context

The visible evidence of a Memory Leak is an increasing consumption of Memory across hours or days, sometime very slowly.


In our case, there was no special activity registered on the Pre-Production environment during the leak.

Procedure for the Heap Dump


As the memory consumption increases regularly, do two snapshots, separated at least from 3 hours or more if possible. You do not have to reach the OutOfMemory on your pod during this timeframe.

1- First log on on the Cluster using the oc login command.

2- Get the name of your Pod that revealed the memory leak.

> oc get pods | grep ads-runtime-service
dba2302ads-ads-runtime-service-8698b98f78-6vwkw              1/1     Running         0 (8d ago)       13d
dba2302ads-ads-runtime-service-8698b98f78-gz4k6              1/1     Running         0 (8d ago)       13d

 
Use the correct name for the other commands ...

3- Then, do a snapshot of your Heap using the following command line:

 

oc exec -ti dba2302ads-ads-runtime-service-8698b98f78-6vwkw -- /opt/ibm/wlp/bin/server dump --include=heap
Defaulted container "parsing-service" out of: parsing-service, tls-init (init)

Dumping server defaultServer.
Server defaultServer dump complete in /opt/ibm/wlp/output/defaultServer/defaultServer.dump-24.01.09_16.35.01.zip.

 If you do not specify a server name, defaultServer is used. However, in some cases, you must specify the name of the server.

 

> oc exec -ti dba2302ads-ads-runtime-service-8698b98f78-6vwkw -- /opt/ibm/wlp/bin/server dump ads-runtime --include=heap
Defaulted container "runtime-service" out of: runtime-service, tls-init (init), folder-prepare-container-ads (init)
  Dumping server ads-runtime. Server ads-runtime dump complete in /opt/ibm/wlp/output/ads-runtime/ads-runtime.dump-24.01.09_16.35.01.zip.


4- Copy the generated file from the pod to your local disk to ease the analysis.

> oc cp dba2302ads-ads-runtime-service-8698b98f78-6vwkw:/opt/ibm/wlp/output/defaultServer/defaultServer.dump-24.01.09_16.35.01.zip defaultServer.dump-22.10.25_15.22.29.zip
Defaulted container "runtime-service" out of: runtime-service, tls-init (init)
tar: Removing leading `/' from member names

or 

> oc cp dba2302ads-ads-runtime-service-8698b98f78-6vwkw:/opt/ibm/wlp/output/ads-runtime/ads-runtime.dump-24.01.09_16.35.01.zip ads-runtime.dump-22.10.25_15.22.29.zip
Defaulted container "runtime-service" out of: runtime-service, tls-init (init), folder-prepare-container-ads (init)
tar: Removing leading `/' from member names


5- Finally remove this important Zip file from the Pod to be sure that the Pod is not using too much disk storage and having the risk to be evicted from the Cluster.

> oc exec -ti dba2302ads-ads-runtime-service-8698b98f78-6vwkw -- rm /opt/ibm/wlp/output/defaultServer/defaultServer.dump-24.01.09_16.35.01.zip
Defaulted container "runtime-service" out of: runtime-service, tls-init (init)

 or

> oc exec -ti dba2302ads-ads-runtime-service-8698b98f78-6vwkw -- rm /opt/ibm/wlp/output/ads-runtime/ads-runtime.dump-24.01.09_16.35.01.zip
Defaulted container "runtime-service" out of: runtime-service, tls-init (init), folder-prepare-container-ads (init)

Tip for Ephemeral Storage


To do the Snapshot, one must be careful to stay in the pre-defined ephemeral storage limits from the Pod, if defined.

If an eviction of your Pod occurs during the Snapshot, then you can remove this constraint.

1- First, scale down the CP4BA Operator:

> oc scale deployment ibm-cp4a-operator --replicas=0
deployment.apps/ibm-cp4a-operator scaled

And if needed, all sub-operator as well. 

> oc scale deployment ibm-ads-operator --replicas=0
deployment.apps/ibm-ads-operator scaled

2- Finally remove the Ephemeral Storage constraint using the following command (we suppose that the jq command is available on your machine):

> oc get deployment dba2302ads-ads-runtime-service -o json | jq 'walk(if type=="object" then del(."ephemeral-storage") else . end)' | oc replace --force -f -
deployment.apps "dba2302ads-ads-runtime-service" deleted
deployment.apps/dba2302ads-ads-runtime-service replaced

When the Snapshot are done, you must scale up the CP4BA Operator.

> oc scale deployment ibm-cp4a-operator --replicas=1
deployment.apps/ibm-cp4a-operator scaled

And if needed, all sub-operator as well. 

> oc scale deployment ibm-ads-operator --replicas=1
deployment.apps/ibm-ads-operator scaled

Analysis of the Heap Dumps


At the end of the Procedure described above, you'll obtain 2 Zips of Heap Dump that we have to compare to find the growing objects/classes.

To do that, d
o a diff using the Eclipse Heap Dump Analyzer named "Memory Analyzer (MAT)" : https://www.eclipse.org/mat/

You'll probably need an additional extension for the IBM JDK => https://www.ibm.com/support/pages/eclipse-memory-analyzer-tool-dtfj-and-ibm-extensions

In our case, the difference reveals the following Class Names:


And in this case, you can notice that the DB2 Driver is responsible (com.ibm.db2.*) and also that the Crypto layer (sun.security.*, com.ibm.crypto.* and javax.crypto.*) is also included in that issue.


You can either find the corresponding issue in your custom code or find a possible reason directly in the web, such as
https://www.ibm.com/support/pages/apar/IJ39517 or involve the team in charge of this packages (ODM team, DB2 team or JDK team).


#OperationalDecisionManager(ODM)
#performance
#CP4BA

0 comments
111 views

Permalink