App Connect

App Connect

Join this online user group to communicate across IBM product users and experts by sharing advice and best practices with peers and staying up to date regarding product enhancements.

 View Only

Problem Determination guidance for ACE in containers

By AMAR SHAH posted Thu March 16, 2023 08:39 AM

  

This article provides general guidelines for problem determination of ACE running in containers.  This is applicable to ACE running in CP4I or any supported K8S environment.

We will take some of the commonly reported symptoms and provide potential reasons for it. We then describe various things user can take a look at to determine the cause of it. We finally provide possible actions user can take to mitigate the situation or to perform advance diagnostics.

We will discuss following symptoms:

  • Low message rate/throughput
  • High CPU usage by Integration Server pods
  • High Memory usage by IS pods or  pod restart due to OOMKilled Errors 
  • Slow startup of IS pod/container

Symptom Potential Reasons How to Determine the Cause? Possible Actions or Remediation
Low Message Throughput Inadequate allocation of CPU limits to the pod. Openshift metrics, pod metrics. Check the pod metrics for CPU utilization and if it is near upper limit, consider allocating more CPU core to the pod.
Poorly designed message flows. Collect Message flow Acc & Stats and look for potential hotspots from flow and node stats.

Refer to IBM Docs for code design tips:
https://www.ibm.com/docs/en/app-connect/12.0?topic=performance-code-design

Large no. of message flows deployed to a single pod. By looking at the no. of message flows. It is recommended to deploy small no. of message flows to a container, typically 1 application or the related applications.
Inadequate no. of replicas of pod. See if container CPU usage is getting closer to its limits. Increase the no. of replicas of the pod for horizontal scaling.
Insufficient JVM heap size tuning for Java based workload. Observe JVM Resource stats to see if there is excessive Garbage Collection or heap usage reaching JVMMaxHeapSize. Increase the JVM maxHeapSize if you find that Garbage Collection is getting kicked in too often.
Inadequate no. of additional instances on BAR/MessageFlow. Collect Message flow Acc & Stats and check for Times max instances are hit.


Tune the additional instances in combination with other parameters described above.



Symptom Potential Reasons How to Determine the Cause? Possible Actions or Remediation
High CPU Large no. of message flows deployed to a single pod. Openshift metrics, pod metrics. Allocate more CPU to ACE containers. IS CR values to be tuned: spec.pod.containers.runtime.resources.limits.cpu spec.pod.containers.runtime.resources.requests.cpu.
Poorly designed message flows. Collect Acc & Stats (flow stats and node stats) and check for CPU metrics like Avg CPU Time, Total CPUTime. a) Ensure the message flow conforms to coding best practices.
https://www.ibm.com/docs/en/app-connect/12.0?topic=performance-code-design
b) Identify the flows/nodes hotspots from Acc & Stats data and do deep dive into it to check for coding practices or size of messages etc..



Symptom Potential Reasons How to Determine the Cause? Possible Actions or Remediation
High memory or OOM Large no. of message flows deployed to a single pod. By observing the no. of BAR files and Message Flows deployed to the pod. Allocate more memory to ACE containers.
IS CR values to be tuned:
spec.pod.containers.runtime.resources.limits.memory spec.pod.containers.runtime.resources.requests.memory
Poorly designed message flows. By observing the overall size of messages and complexity of message flow. Ensure the message flow conforms to coding best practices.
https://www.ibm.com/docs/en/app-connect/12.0?topic=performance-code-design
Native memory leak. Memory usage of pod continues to grow over a period of time for the same workload. Try isolating the problem to a specific message flow or application.
Try reproducing problem independently in an on-prem environment if available and perform memory leak mustgather. Or Contact IBM Support.
Java memory leak. Check if there are any Java OOM errors, javacores. Increase the Java MaxHeapSize by 25-50% and see if
it stops Java OOM errors. Heapsize might need to be tuned in Multiple iterations.



Symptom Potential Reasons How to Determine the Cause? Possible Actions or Remediation
Slow Startup of Containers Inadequate allocation of CPU limits to the pod. Pod restart due to liveness/readiness check failure. Increase the CPU request/limits for the container.
Large no. of message flows or BARs deployed to a single pod. Check no. of BAR files allocated to the IS or no. of message flows in a single BAR. Consider spliting the flows into multiple integration servers.
Un-optimized server (i.e. ibmint optimize server not run typically in a custom image). By observing the pod console log. The ACEcc images by default run ibmint optimize server but if the user builds their own image, ensure they run ibmint optimize server just before IS startup.

0 comments
74 views

Permalink