App Connect

 View Only

Introducing monitoring for App Connect Integration Runtimes on our managed service

By Adam Roberts posted Tue March 19, 2024 07:00 AM

  

Running out of resources is never fun – that’s why we’ve built an in-product monitoring solution to help customers manage their Integration Runtimes. 
This new feature provides an enhanced visual experience which can be configured to display what is relevant to you (be it tracking one runtime's resource usage over time, or gaining a high-level overview of what's happening inside your instance).

You may be wondering why we didn’t expose a view into an existing solution – I will touch more on this later and I’ll also be answering some of the questions you might have. This also has the benefit that you do not require any additional services.

Let’s get straight into it and see what you can now do. You can also check out our full IBM App Connect Monitoring Resources documentation which will be kept up to date should anything change.

Prerequisites

An IBM App Connect Enterprise as a Service VPC plan instance is required. This feature is available for both trial and paid customers.

Getting started

Through our handy Monitoring page (currently available from the left nav bar), you are presented with the All Runtimes page. This page provides an insight into what your biggest and smallest consumers of CPU and memory are.

This is a test environment where I've created a variety of different integration runtimes with a script.

Monitoring page summary - a variety of runtimes, nothing major standing out

We’ll show you the top five consumers of CPU and memory, by default. We take this into account on a “per container” basis – because if any one of them is nearing its limits, you’ll want to know. We can scroll down to see the table which includes every integration runtime for this instance.

Note that the table is sortable by clicking on the headers so you can easily identify your resources by name or maximum resource usage. 

Monitoring page summary - a variety of runtimes, nothing major standing out, scroll for table

We can click on a particular integration runtime (let's say it's the "awesome-runtime"), to view CPU and memory statistics over the selected time period.

Awesome runtime starting up

We can see the integration runtime starting up and then becoming idle and steady. All good so far.

However - from that first example, a particular integration runtime was reported to be using over 100% of the CPU. How can this be?

In the next example, I've logged onto a different instance where a similar situation is observed and will explore this further.

App Connect monitoring UI showing Runtimes (CPU and memory summary)

Let's examine what we see on this screenshot and identify what's happening that appears odd, in case you didn't spot it already.

  • There are at least five integration runtimes running in this instance (as our "highest five" view is telling us)
  • There's a mixture of flows being run - they all involve the "runtime" container, some involve the "designerflows" container, and one involves a "designereventflows" container.
  • The flow running a designereventflows container has exceeded its CPU limits, but its memory amount is OK. What's going on here?

The designereventflows container was created with a 500 millicores CPU limit and a 512 MB memory limit. It's important to keep this in mind (and you can determine this from the UI by hovering over its bar in the chart).

If we scroll down to look at the table view, we can again see the same information but in a different format.

App Connect monitoring UI showing Runtimes (CPU and memory summary) - scroll down to reveal table

Each row is clickable, and we provide a per container breakdown so you can gain an insight into what’s been happening in that container over your selected time period.

If we were to do that (focusing on the container that's exceeded its maximum CPU usage %).

App Connect monitoring UI showing Runtimes (CPU and memory summary) - table view

We can see that the runtime "batch-ctip-appc-p-vir" has exceeded its maximum CPU usage (and we noticed that from the chart).

What might this flow be doing? The average core amount isn't especially high so it must be relatively idle most of the time.
By clicking on its hyperlinked name, we are taken to the single runtime view.
App Connect monitoring UI showing a single runtime, batch IR, all containers
We can gain useful insights from this immediately. We know the "designereventflows" container seems to be particularly interesting from the CPU chart and we can see it is using 0.5 cores frequently.
Is there enough data here to form a pattern? The memory usage looks OK, too. We can at least see that it's relatively idle most of the time.
If we click on the "designereventflows" icon under the chart, we can focus specifically on this particularly interesting container:
App Connect monitoring UI showing a single runtime, batch IR, designereventflows container only
Suppose we look at this data over a three-hour period instead?
App Connect monitoring UI showing a single runtime, batch IR, designereventflows container only - 3h
Now we can see a pattern emerging. We can see from this view that every 30 minutes, this flow runs and exceeds its CPU usage.
Note that currently what's shown in the summary on the side refers to all of the containers' limits added up.

What's going on?

Your Integration Runtimes consist of one or more containers. We run these on OpenShift which includes a distribution of Kubernetes. In Kubernetes, when the memory limit for a container is reached it will be terminated with an Out of Memory event. However, for CPU usage it is possible for this limit to be exceeded before Kubernetes will "throttle" this container. For this particular scenario, it is therefore suggested that more resources be allocated to this particular container. It's a batch process and it retrieves 1000 documents from Cloudant every 30 minutes as part of a continuous test on production - this integration runtime has conveniently been named to indicate that, too.
The resizing can be achieved through the App Connect user interface or by using the public API which I have blogged about previously. In this case, it is recommended to increase the CPU amount to just over the limit being reached: 0.75 CPU (or 750 millicores) should prevent the throttling. It is important that you size your runtimes based on their peaks and not their averages.
Without having this view, it would be impossible for users of the App Connect instance to be able to view historic activity for their Integration Runtimes. We simply wouldn't know there's a problem without checking our metrics that users cannot see for themselves (and there is a metric for throttling that we used to validate this theory).

What else could this be used for, apart from "rightsizing"?

  1. Problem determination - is there a memory leak with certain flows that are running?
  2. Analysing footprint - you can see how much memory and CPU is required when running your flows so you have a rough estimate of how much this may eventually cost.
    Note that this information should not be used for precise billing calculations as we do perform optimisations when we render your charts and retrieve the data - we do not process every second of data, for example, when looking at a large time period.
  3. Identifying "laggards" - which runtimes do you have that you think really should be doing work but are actually idle?

Frequently asked questions

Is there an API for this? And how can I export my data?

No, but we understand that may be useful to you - let us know what you think and what format of data you would be expecting.

Can I rely on this for my billing data?

Check our in-product disclaimer and official docs. We carry out various amounts of rounding of figures and data aggregation to give you a useful user experience – we do not plot every single data point over a large time period for example.

What data “aggregation” do you do? I don’t see thousands of points over a long time period, do you “smooth things out?

The longer the time period, the bigger the step interval will be between the points. We'll process and display less data when looking at six hours compared to one hour for example. This is to avoid displaying potentially thousands of data points on what can be a small chart.

Can I provide a custom time range?

Not yet but given that we omit displaying every single data point over a range, this would be useful so that you can “drill down” into a particular timeframe with a high level of detail. Let us know if you would find this useful.

Why can I go back one day and not any longer?

This monitoring data is only persisted for that long currently in our systems - if you would find it useful to go back further (how much further?), please let us know.

Could I use Instana, Grafana, or any other monitoring solution instead in this way?

As this feature is available through our SaaS offering, we have our own Grafana Dashboards which allows us to solve customer problems. However, that's for App Connect Engineers and Support to use. Instead, we’ve provided a solution that is specific to you, and we’ve added our own “in-product” quirks so that it feels more like an “App Connect native” solution instead of one that’s “off the shelf”. This lets us display only what's relevant for you.

What about data for my running integrations?

We’d like to be able to provide this data to you as well, so watch this space. What metrics would you like to know about? We think “per node latency”, “per node invocation count”, and then "per flow latency" and "per flow invocation count" metrics would be of use.

Closing thoughts

Let us know what you think and if you found any of this useful and what you would be interested in seeing next.

There are many ways we could have implemented this, but we felt that having an in-product feature is the most beneficial experience for you. 

A special thanks to the IBM Hursley Design team and the App Connect Development team for their contribution to delivering our first in-product historic monitoring and observability solution for IBM App Connect Enterprise as a Service.

Need more proof or simply excited to get started? Try the service today by signing up to the free trial and see for yourself on the Amazon Marketplace. You can also submit your own ideas and suggestions via our ideas.ibm.com page.

0 comments
53 views

Permalink