Cloud Pak for Business Automation

Cloud Pak for Business Automation

Come for answers. Stay for best practices. All we’re missing is you.

 View Only

Introduction to CPU and Memory Profiling in BTS-Operator (Go-based Operators)

By Benjamin Wende posted yesterday

  

Author: Benjamin Wende

Introduction to CPU and Memory Profiling in Go-Based Operators

Profiling is a key technique for understanding how your software consumes system resources, especially CPU and memory. In Kubernetes environments, operators like the IBM Business Teams Service (BTS) Operator, written in Go, manage application lifecycles and can themselves become performance bottlenecks if not properly tuned.

Go provides built-in support for profiling through the net/http/pprof package. This tool allows developers and operators to collect data on CPU usage, memory allocations, goroutine states, and more. Profiling helps diagnose performance issues, detect memory leaks, and optimize resource usage. When enabled, pprof exposes an HTTP server that can be queried to retrieve profiling information at runtime.

This Blog provides a hands-on walkthrough for enabling, collecting, and analyzing profiling data for the BTS-Operator and similar Go-based operators, using guidance from IBM's documentation and Go's official tooling.

Enabling Profiling in the BTS-Operator

To start profiling, the operator must be explicitly configured to expose it's profiling endpoints. This is done using an environment variable on the Operator deployment.

Steps to Enable Profiling:

Edit the BTS-Operator subscription resource and add the environment variable ENABLE_PROFILING_SERVER:

$ oc edit subscription ibm-bts-operator

Add the environment variable to the spec.config section:

...
spec:
  config:
    env:
    - name: ENABLE_PROFILING_SERVER
      value: "true"

Save the changes. This change will trigger the creation of a new BTS-Operator pod with profiling endpoints enabled. The process is also documented in the BTS Knowledge Center in [1].

The profiling endpoints are now available on the BTS-Operator pod and must be exposed to accessible to the outside world. An easy way to do this is port-forwarding. Use following command to expose the profiling port from the pod to your local machine:

$ oc port-forward pod/ibm-bts-operator-controller-manager-6bcc8d5cfb-45ztt 8082
Forwarding from 127.0.0.1:8082 -> 8082
Forwarding from [::1]:8082 -> 8082

The profiling endpoints are now available on port 8082 the hostname localhost of your local machine.

Generating and Collecting Profiling Data

In order to generate CPU and Memory profiles from the running BTS-Operator, the pprof APIs must be called. 

Here is an example of creating a CPU and Memory profile using curl:

$ curl -s "http://127.0.0.1:8082/debug/pprof/heap" > ~/heap-profile.out
$ curl -s "http://127.0.0.1:8082/debug/pprof/profile" > ~/cpu-profile.out

The commands will get the profiling data from the pprof server and save it locally into a file.

To diagnose Memory leaks and Out-Of-Memory (OOM) errors, it makes sense to periodically generate these files and save them for later analysis. As the heap slowly gets larger, it is impossible to predict when the exact OOM error will occur.

The following shell script will take a CPU and memory dump in a pre-defined period:

#!/bin/bash

# Check if period and target directory are provided
if [ -z "$1" ] || [ -z "$2" ]; then
  echo "Usage: $0 <period-in-seconds> <target-directory>"
  exit 1
fi

PERIOD="$1"
LOG_DIR="$2"

# Create target directory if it doesn't exist
mkdir -p "$LOG_DIR"

echo "Starting profiling loop with a period of $PERIOD seconds..."
echo "Profiles will be saved to $LOG_DIR"

while true; do
  TIMESTAMP=$(date +"%Y%m%d-%H%M%S")
  echo "Collecting profiles at $TIMESTAMP..."

  curl -s "http://127.0.0.1:8082/debug/pprof/heap" > "$LOG_DIR/heap-profile-$TIMESTAMP.out"
  curl -s "http://127.0.0.1:8082/debug/pprof/profile" > "$LOG_DIR/cpu-profile-$TIMESTAMP.out"

  sleep "$PERIOD"
done

It can be used like this:

$ ./go-profile.sh 60 ~/bts-operator-profile 
Starting profiling loop with a period of 60 seconds...
Profiles will be saved to /Users/bwende/logs/bts-operator-profile
Collecting profiles at 20250627-153532...
Collecting profiles at 20250627-153704...

This will take a CPU and Memory dump every 60 seconds until any issue occurs, like the pod goes into OOM error. The latest profile data can then be used to analyze OOM issues. It is also possible to compare the dumps with each other and see if there are any certain objects that sum up.

Analyzing Profile Data

When optimizing the performance of a Go application, profiling is one of the most powerful tools available. Go makes this especially developer-friendly through it's built-in pprof tool, which allows you to visualize CPU and memory (heap) usage in both text-based and graphical formats. This chapter focuses on using the graphical user interface (GUI) provided by the go tool pprof -http command, which brings deep insights to life through interactive visualizations.

So let's jump right into opening the Profiling User Interface:

go tool pprof -http=:8080 ~/logs/bts-operator-profile/heap-profile-20250630-083751.out

This command will open a new Browser Window and show the pprof UI:

pprof Memory Profile

The screenshot shows a heap profile call graph generated by Go's pprof web UI, providing a visual breakdown of memory allocations in the BTS-Operator executable. Each box represents a function, with it's size and color intensity indicating how much memory it has allocated — darker red means higher memory usage. At the center, we see that (*ConfigMap).Unmarshal is the top memory consumer, responsible for over 31% of total heap allocations, making it a primary candidate for optimization. The arrows between nodes illustrate function call relationships, showing how memory usage propagates through the call stack. This graph makes it easy to trace high-memory paths and pinpoint the exact sources of inefficient allocations in the code.

pprof CPU Profile

This screenshot shows a CPU profile call graph from the pprof web UI for the BTS-Operator, visualizing how the application consumed processor time during execution. Each box represents a function, with arrows indicating call relationships and labels showing how much CPU time (in milliseconds) was spent in each function. The highlighted box controller.(*ConfigMapReconciler).Reconcile is at the root, showing it initiated most of the observed activity. Functions like runtime futex, json stateT, and client.(*Client).Do appear as notable contributors to CPU time, indicating possible targets for optimization or deeper inspection. This graphical view makes it easy to identify the hottest code paths, understand execution flow, and optimize CPU-heavy logic within the operator.

pprof CPU Flame Graph

The memory flame graph in Go's pprof web interface is one of the most powerful visual tools for analyzing heap usage and memory allocations in your application. It helps developers quickly identify which functions are responsible for the majority of memory consumption and how allocations flow through the call stack.

Each horizontal bar in the flame graph represents a function, and it's width indicates the amount of memory allocated by that function and it's callees. The wider the bar, the more memory it consumed. Bars are stacked vertically to show the call hierarchy, meaning functions at the top called those below. This layout gives you a top-down view of how memory usage propagates through the application.

By default, the graph is sorted so that the most memory-intensive call paths appear in the center. You can hover over any bar to see exact byte counts and percentages, and click to zoom into specific branches of the call tree. This makes it easy to drill down into complex allocation paths and isolate problems like memory bloat, excessive allocations, or potential leaks.

An in-depth discussion about (CPU-) flame graphs can be found in [2]

Summary

The article introduced CPU and memory profiling for Go-based Kubernetes operators, focusing on the BTS-Operator. It explained how profiling was enabled using pprof, how profiling data was collected via HTTP, and how a shell script was used to automate periodic dumps for diagnosing memory leaks and performance issues. The analysis section demonstrated how the pprof web interface was used to examine heap call graphs, CPU execution paths, and memory flame graphs to identify and understand optimization opportunities.

Reference

[1] https://www.ibm.com/docs/en/cloud-paks/foundational-services/4.13.0?topic=service-troubleshooting

[2] https://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html

0 comments
17 views

Permalink