Cloud Pak for Data

Cloud Pak for Data

Come for answers. Stay for best practices. All we’re missing is you.

 View Only

IBM Software Hub performance troubleshooting in action

By Hongwei Jia posted Mon May 19, 2025 02:11 PM

  

IBM Software Hub performance troubleshooting in action

IBM® Software Hub is a cloud-native solution that you can use to install, manage, and monitor IBM solutions on Red Hat® OpenShift® Container Platform (OCP). Numerous Cloud Pak for Data and Watson services can run on top of this platform. Performance troubleshooting for the services running on Software Hub is a challenging task as it involves different layers including the application layer, IBM Software Hub layer, OpenShift cluster layer and infrastructure layer. This article introduces a typical real-world story which showcases the IBM Software Hub performance troubleshooting practice.

Background

A client upgraded their IBM® Software Hub and Watson Discovery service  from 5.1.0 to 5.1.1 successfully. Two days later after the upgrade, the end users complained that the Watson Discovery service performance degraded after the upgrade. There was slowness crossing different web pages when operating in UI. On average, it took about 7 seconds to open the UI page. Sometimes, blank page displayed. IBM was engaged for helping the client resolve the performance problem.

Performance troubleshooting in action

1.Collect and investigate with a HAR file

Firstly, we need to understand the performance problem in customer environment. Because UI slowness was complained by the end users, reproducing the problem and collecting HAR file is a good start for the investigation. A HAR file records all HTTP requests and responses made by the browser during a web session, including:

  • Request/response headers and bodies
  • DNS, SSL, TCP timings
  • Status codes and errors
  • Redirection chains
  • Load times for each resource

The collected HAR file can be imported into a local web browser (e.g. Chrome) for the analysis. When we were reviewing the HAR file, we confirmed the UI slowness that the client complained about. As we can see from the Network tab, the response times of some Watson Discovery service API calls exceeded 9 seconds. And the response times of the API calls with the same URL (e.g. the collections API) were fluctuating.

2.Investigate from the application layer

With the slow API calls identified through the analysis of the HAR file, we investigated the log of the corresponding API calls from the application layer (the application means the Watson Discovery service in this article). And some comparisons were did like below.

1). From HAR:

Time taken: 13.89 s

Request URL: https://cpd-cpd.apps.xxx.com/v2/projects/ae04a336-964e-47f6-902e-8bf2bc1ffde3/collectionsdate: Thu, 01 May 2025 16:47:00 GMTx-global-transaction-id: TOOLING-qwg27zogzwk2

2). From wd-discovery-gateway logs:

Time taken: < 1s

API_ENTRY :

wd-discovery-gateway-866f886d9-zz988_api.log:2025-05-01T16:47:00.893136836Z {"@timestamp":"2025-05-01T16:47:00.892Z","@version":"1","message":"API_ENTRY GET /v2/projects/ae04a336-964e-47f6-902e-8bf2bc1ffde3/collections Map(http-query-params -> {version=[2023-03-31]}, user-agent -> Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/135.0.0.0 Safari/537.36, fe-request-body-length -> 0)","logger_name":"c.i.w.r.f.r.LogRequests","thread_name":"Default Executor-thread-70","level":"INFO","level_value":20000,"txn-id":"TOOLING-qwg27zogzwk2,"start-time":"1746118020892","payload-txn-id":"gateway-generated-26ca577c-2470-459e-9fdd-50f2cc98b980","req-start-time":"1746118020892","tenant-id":"00000000-0000-0000-1727-200308481569","plan-id":"cec95e99-75b8-4e2f-a176-8687f31597fd","http-method":"GET","region-id":"us-south","org-id":"ba4ab788-68a9-492b-87da-9179cb1e6541","http-path":"/v2/projects/ae04a336-964e-47f6-902e-8bf2bc1ffde3/collections"}

API_EXIT :

wd-discovery-gateway-866f886d9-zz988_api.log:2025-05-01T16:47:00.916507856Z {"@timestamp":"2025-05-01T16:47:00.916Z","@version":"1","message":"API_EXIT 200","logger_name":"c.i.w.r.f.r.LogResponses","thread_name":"Default Executor-thread-70","level":"INFO","level_value":20000,"txn-id":"TOOLING-qwg27zogzwk2","api-version":"2023-03-31,"start-time":"1746118020892","payload-txn-id":"gateway-generated-26ca577c-2470-459e-9fdd-50f2cc98b980","req-start-time":"1746118020892","tenant-id":"00000000-0000-0000-1727-200308481569","plan-id":"cec95e99-75b8-4e2f-a176-8687f31597fd","http-method":"GET","region-id":"us-south","org-id":"ba4ab788-68a9-492b-87da-9179cb1e6541","http-path":"/v2/projects/ae04a336-964e-47f6-902e-8bf2bc1ffde3/collections","req-elapsed-ms":24,"http-status":200}

As we can see that there is big difference between the response time of the application API call and that calling from the UI captured in the HAR file (13.89 s vs 1 s). We did similar comparison for several API calls and big differences were also identified.

Why there’s such big difference? This led us to the next step.

3.Analyze the API call response time with breakdown

With the above question, we analyzed the API call response time with the HAR file again. This time, we had a breakdown of the response time. To break down API response time in Chrome Dev Tools, we can navigate to the Network tab, select the corresponding http request (API call), and open the Timing chart. This will reveal details like the time spent on "waiting for server response"  or on the http connection phrases including "Stalled",  "DNS Lookup",  providing insights with the breakdown.

During our analysis with the timing breakdown, we noticed lots of http requests stuck in "Stalled" status for more than 9 s. One example looks like below.

Then we had another question - Why the http requests stuck in the "Stalled" status for so much time ( 9 s)?

After doing the research, we got to know that a web application can be choked by Chrome’s HTTP/1.1 six connection per domain limit. So if the frontend sent 6 requests concurrently on 6 connections, all subsequent requests get stalled and queued until a connection is available (a response was received). This can lead to performance and latency issue.

Woo! It seems we found the root cause - too many concurrent http requests.

4.Review and discuss with Product / Development team

We involved the Watson Discovery (application) development team about the potential root cause that we identified. While, the application development team told us there was no code change when upgrading the application to the new version. And they were pretty sure about this. If there’s no performance problem in the old version, then it should not such problem in the new version.

Even we doubted whether it’s true, we thought deeper with the development team together. Could it be the slowness of the http requests in the front caused the "Stalled" status of the http requests sent later after them?

Looking into the same chart with the breakdown of the Timing (see below example), we found the time spent on "waiting for server response" exceeded 3 s.  

Then we enabled the access logging of IBM Software Hub API gateway and after reproducing the problem in client environment, we did notice the slowness by checking the elapsed_time metric.

2025-05-05T18:22:24.408954458Z 137.188.11.249 - https://cpd-cpd.apps.xxx.com/discovery/cpd-wd/instances/00000000-0000-0000-1727-200308481569/projects/b95dfc49-ead0-4904-aafd-1a2484661d60/collections/3739f076-03a3-0057-0000-019657a655b1/manage_fields - [05/May/2025:18:22:24 +0000] \" GET /discovery/cpd-wd/api/instances/00000000-0000-0000-1727-200308481569/v2/enrichments HTTP/1.1\" 200 bytes=16575 txn_id=TOOLING-3m2wmxqm65now elapsed_time=3.984 

2025-05-05T18:23:35.503958717Z 137.188.11.249 - https://cpd-cpd.apps.xxx.com/discovery/cpd-wd/instances/00000000-0000-0000-1727-200308481569/projects/b95dfc49-ead0-4904-aafd-1a2484661d60/collections/3739f076-03a3-0057-0000-019657a655b1/manage_fields - [05/May/2025:18:23:35 +0000] \" GET /discovery/cpd-wd/api/instances/00000000-0000-0000-1727-200308481569/v2/projects/b95dfc49-ead0-4904-aafd-1a2484661d60/collections/3739f076-03a3-0057-0000-019657a655b1/annotations/labels HTTP/1.1\" 200 bytes=1937 txn_id=TOOLING-r19vevlmmx1o9 elapsed_time=3.871

This meant there’s poor performance  of the API call.  The big number of concurrent http requests made the poor performance situation worse. While, we can’t take it as the root cause.

5.Investigate from the IBM Software Hub layer

The API call workflow of  Watson Discovery  application is like this:

Http request:  1client > 2.Software Hub API gateway > 3.Watson Discovery API server

Http response: 4.Watson Discovery API server > 5.Software Hub API gateway > 6.client

In the Step 2 -Investigate from the application layer, we confirmed that there’s no latency for the API call from the application layer (Watson Discovery API server). As a result, the slowness could be potentially caused by the performance problem in the Software Hub layer.

Just at that time, client happened to tell us that there were new workloads running in the cluster after the upgrade.  Woo. With this news, we felt there’s high chance the performance problem was caused by the newly added workloads. So checking the resource utilization of the cluster, especially focus on the Software Hub layer, became our top priority task.

To check the resource utilization, we can use the Dashboards in OpenShift web console.

In the Dashboards web page, select the metrics as below.

Note: remember to change the Time Range to a proper value for covering the time required for the troubleshooting.

In this page, we can check whether there’s high utilization of the pods. We focused on the CPU Quota and Memory Quota metrics. And when reviewing these metrics, it’s recommended sorting the CPU Limits metric and Memory Limits metric using the Descending order. This can help detect the pods with the high utilization.

When we were reviewing the CPU Limits metric,  we found the zen-metastore-edb-2 pod CPU utilization was 98%.

And the CPU throttling was also detected.  

In addition, we also found there were 6 restarts of the zen-metastore-edb-2 pod. This was strong evidence that we need to scale up the resource for this pod. At this point, we believed we have figured out the root cause.

6.Resolution 

With the root cause identified, we applied the solution in customer environment. Once it’s done, customer confirmed that the performance problem was resolved and they were happy with the resolution.

Summary

IBM Software Hub performance troubleshooting is a complex task. As we can see from this typical real-world story, troubleshooting from different layers using different techniques maybe required. Hope the performance troubleshooting practice introduced in this article can be helpful as a good troubleshooting reference.

0 comments
6 views

Permalink