Cloud Pak for Data

Come for answers. Stay for best practices. All we’re missing is you.

View Only

Back to Blog List

Enhanced Observability in IBM Software Hub with Instana Support

By Yongli An posted 19 hours ago

Introduction

In the world of software applications, observability allows you to understand what’s happening inside your applications and infrastructure by examining the most important signals from telemetry data: logs, metrics, traces, and application flows.

Instana is IBM’s real-time full-stack observability solution. This enterprise-ready platform has gained widespread adoption and trust among organizations worldwide due to its comprehensive monitoring capabilities.

As IBM Software Hub (SWH) deployments grow in scale and complexity, understanding application behavior becomes critical. With increasing customer demand for better observability, integrating Instana support into SWH and its services became a strategic priority. Having completed Instana integration across our core services, we now officially offer Instana support in SWH 5.3.

Prerequisite

To leverage Instana integration, you need:

Instana Server Access: Typically centralized and shared across your organization. IBM recommends using Instana SaaS for convenience.
Instana Agent Installation: Deploy the Instana Agent on your OpenShift cluster where SWH runs.
Setting up and maintaining the Instana server and agents is beyond this blog’s scope. This guide assumes you have access to an existing Instana server. For agent installation instructions, see the Instana documentation for installing agent
Enabling Instana Integration: Once prerequisites are met, enable metric collection by adding this flag to your Service Custom Resource:
enableInstanaMetricCollection: true

What’s included in this release

In SWH 5.3, about 60 services now support Instana, offering comprehensive coverage and a standardized implementation to ensure a consistent user experience.

Simple configuration

Using a newly introduced standard flag, you can enable or disable Instana metric collection via the service Custom Resource (CR). The system handles all configuration automatically—no complex setup required.

Rich technology support

Most of the technology used by the services in SWH are supported by Instana, with very rare exceptions for which you can refer to the SWH documentation for details. Instana supports most technologies used in SWH services, including:

Node.js, Python, Go, Java
PostgreSQL/EDB, MongoDB
Nginx

Observability Coverage

Instana provides insights into:

Service-level metrics and overall performance
API call flows with detailed latency analysis
Internal dependencies and component relationships
Database query efficiency and performance
Resource utilization trends and patterns

Getting started

Instana complements existing monitoring tools—such as Prometheus queries, OpenShift dashboards, and the SWH monitoring console—by providing broader visibility across all system layers. This unified approach delivers a holistic view of your entire environment, from infrastructure to application-level metrics.

This section provides UI examples demonstrating how to use Instana for observability across different layers, from infrastructure to application level:

Cluster level monitoring
View infrastructure health, resource utilization, and overall cluster performance.
Service and API endpoint level analysis
Monitor individual services, API performance, and request flows.
Cluster level events
Track cluster events, alerts, and anomalies across your environment.

Cluster level monitoring

After logging into Instana, from the Instana home page, follow these steps to view your cluster level metrics:

Click Platforms in the left navigation pane
Select Kubernetes to open the Kubernetes monitoring page
(Optional) Switch to table view by clicking the list icon in the upper-right corner
Locate your cluster using the search bar or by browsing the list

Important: The cluster name displayed in Instana matches the name or label defined in your Instana Agent configuration YAML, which may differ from your OpenShift cluster name.

Example: As shown below, if you configured your Instana Agent with the cluster tag d107-DEMO, search for d107 to locate your cluster in the list.

Tip: If you're unsure of your cluster's Instana name, check the cluster.name or zone.name field in your Agent configuration YAML.

A screenshot of a computer

AI-generated content may be incorrect.

As you can see, my cluster is one of the 31 clusters being connected to the shared Instana server instance. This view shows some high-level information including the number of nodes, services, and pods that are running, with the last column indicating the cluster is healthy or not.

Click the name and you are now at the Kubernetes level summary view showing more details, including stats such as “CPU/Memory requests,” “CPU/Memory limits allocation,” and “Pods allocation” in percentages, and some graphs with historical data for the same metrics in absolute numbers.

Cluster list view

The Kubernetes platform page displays all clusters connected to your Instana server. In this example, 31 clusters are monitored by the shared Instana instance. The list view provides high-level information for each cluster:

Number of nodes
Number of services
Number of pods
Health status indicator

Cluster detail view

Click on your cluster name to access the detailed Kubernetes cluster overview. This view displays:

Resource allocation metrics (percentages):

CPU requests
Memory requests
CPU limits
Memory limits
Pod allocation

Historical trends (absolute values):

Time-series graphs show the same metrics over time, allowing you to identify patterns and potential resource constraints.

Tip: Use the percentage metrics to quickly assess resource utilization, and refer to the historical graphs to understand trends and plan capacity.

A screenshot of a computer

AI-generated content may be incorrect.

Exploring cluster details

Additional tabs at the top of the page provide deeper insights into your cluster. For example, the Nodestab displays detailed metrics for each node, including resource utilization, capacity, and health status on each node.

Other available tabs include Namespaces, Deployments, Pods, Services, and Events—each offering specific views into different aspects of your cluster.

A screenshot of a computer

AI-generated content may be incorrect.

Viewing service pods

The Pods tab lists all running pods with status, resource usage, and restart counts. To examine a service specifically in SWH:

1. Filter by the Namespace column
2. Select your SWH namespace (e.g., zen)
3. Click a service pod to see more details of the pod

Note: The SWH namespace name may vary based on your installation configuration.

A screenshot of a computer

AI-generated content may be incorrect.

Persistent Volumes tab: storage monitoring

The Persistent Volumes tab provides critical storage metrics for your cluster:

Capacity: Total storage allocated to each PV
Usage: Current storage consumed (in MiB/GiB)
Utilization: Percentage of capacity used

Monitoring storage health

Sort by the Utilization column (descending) to quickly identify volumes approaching capacity limits. High utilization (>80%) may indicate the need for:

Storage expansion
Data cleanup or archiving
Investigation of unexpected growth

Why this matters for SWH
Many services (databases, logging, data stores) in SWH rely on persistent storage. Monitoring PV utilization helps prevent service disruptions due to storage exhaustion.

A screenshot of a computer

AI-generated content may be incorrect.

Service and API endpoint level analysis

Service-level observability provides detailed insights into individual service performance, including API response times, error rates, throughput, and dependencies. This granular view helps identify bottlenecks and troubleshoot issues that may not be visible at the cluster level.

Navigate to service details

Instana offers multiple paths to service-level metrics. The quickest method:

Click Applications in the left navigation bar
Filter by your cluster name to narrow the view
Select the Summary tab

The Summary view displays:

Service call rates and response times
Error rates and types
Top endpoints by traffic and latency

A screenshot of a phone

AI-generated content may be incorrect.

The Summary page as shown above (top potion of the page) displays key performance metrics for all services in your cluster, including:

Call rates (requests per second)
Response times and latency
Error rates

The Calls section offers two view modes:

HTTP status codes: Shows request distribution by status (2xx, 4xx, 5xx) to identify errors
Call count: Displays total requests per second for each service

As shown below, switch to Call count view to:

Evaluate overall system throughput
Monitor load levels across services
Identify high-traffic services that may need scaling
Establish baseline performance metrics

A screenshot of a phone

AI-generated content may be incorrect.

Key performance metrics

Instana tracks three critical service-level metrics:

Latency: Response time for service calls
Calls: Request volume and throughput
Error Rate: Percentage of failed requests

Top services ranking

The lower section of the Top services page displays services ranked by your selected metric. Use the metric selector to switch between:

Top services by Latency: Identifies slow services affecting user experience
Top services by Calls: Shows highest-traffic services consuming resources
Top services by Error Rate: Highlights services with reliability issues

A screenshot of a computer

AI-generated content may be incorrect.

Viewing all services

Access the complete services list via either:

Click View all services in the Summary view, or
Select the Services tab at the top of the page

Sorting by metrics: the Services tab displays multiple metrics including latency, call volume, and error counts. Click any column header to sort by that metric (click again to reverse order).

Example: to find services experiencing the most failures, click the Erroneous calls column header to sort in descending order, the services with the highest error counts appear at the top.

In the example shown in the screenshots below , runtime-manager-api-container shows the highest error count, indicating it requires investigation.

A screenshot of a computer

AI-generated content may be incorrect.

Click the service name of your interest from the above view. A summary view for one service is shown, similar to the service level overall summary view, but now only for the “runtime-manager-api-container” service, an example as shown below:

A screenshot of a computer

AI-generated content may be incorrect.

Endpoint level analysis

For deeper insights, Instana provides endpoint-level metrics and latency breakdowns showing performance across the entire call stack. Click Analyze Calls marked by the red arrow in the above screenshot, you will see a list of sections grouped by endpoint name for the selected service. In this example, runtime-manager-api-container shows 10 endpoint groups, sortable by:

Call volume
Latency (mean, p95, p99)
Error rate

The error rate graph (center, top section) reveals that the GetJob endpoint experienced 100% errors between 11:00 AM and 12:00 PM, indicating complete service failure during that interval. These failures contribute to the 8.17% overall error rate shown in the endpoint list, calculated across the entire time window.

A screenshot of a computer

AI-generated content may be incorrect.

Click the GetJob group, and the section will open to show all the endpoint API calls so we can understand why there is an 8.17% failure rate.

Based on the view below, sorted by timestamp in descending order, you can see a noticeable pattern: starting from 11:37:09, above the red line, all the calls took more than 60 seconds to complete, which seems to indicate timeout errors with a 60-second timeout setting, while the rest of the entries before that timestamp seem to have normal latency.

Now let's drill down to the individual end point calls. Click the `GetJob` endpoint group to expand and view all individual API calls to investigate the root cause of the 8.17% error rate. Sorting by timestamp (descending) reveals a clear pattern:

After 11:37:09: All calls exceeded 60 seconds, indicating timeout failures (likely a 60-second timeout configuration)
Before 11:37:09: Calls completed with normal latency (typically under 5 seconds)

This pattern suggests a sudden performance degradation starting at 11:37:09, causing all subsequent requests to time out.

A screenshot of a computer

AI-generated content may be incorrect.

To further investigate the root cause for the sudden performance degradation:

Click any timed-out call to view its trace
Examine the call stack to identify slow components or any error details

The example below shows “Internal Server Error” in the “Logs” section. This further confirms that some services are no longer available, causing the failures. Potential causes for such failures could include the following but will need additional investigation to confirm further:

Database query delays
External service dependencies
Resource contention (CPU, memory, I/O)
Network issues

A screenshot of a computer

AI-generated content may be incorrect.

Next, the service call flow visualization maps dependencies and communication patterns between services, making it easy to:

Trace request paths across multiple services
Identify performance bottlenecks in service chains
Understand cascading failure impacts

This view can be reached from the service level summary view for a given service (which was shown earlier). To help you recognize where to start, here is the top portion of the UI page:

A screenshot of a computer

AI-generated content may be incorrect.

To view the service dependency diagram, switch from the Summary tab to the Flow tab at the top of the page. The Flow view displays an interactive visualization showing:

Service dependencies and relationships
Call direction and volume
Latency at each service hop
Error rates across the service chain

This diagram helps you trace request paths and identify where delays or failures occur in multi-service transactions.

A screenshot of a computer

AI-generated content may be incorrect.

This example shows a simple service flow. In practice, some services in SWH have far more complex dependency and flow graphs with much more interconnected services.

Cluster level events

You can access cluster events view by clicking Events in the left sidebar, which will lead you to the Incidents dashboard, which displays alerts and anomalies across all monitored clusters. If you monitor multiple clusters, use the filter dropdown (top of page) to narrow the view to your specific cluster. This helps you focus on relevant incidents without noise from other environments.

A screenshot of a computerAI-generated content may be incorrect.

There are other tabs for different events that may be helpful to explore, but they are beyond the scope of this introductory blog post.

Want to see Instana in action? Check out our quick start demo.

Conclusion

Instana provides comprehensive real-time observability for IBM Software Hub deployments, from infrastructure monitoring to API-level performance analysis. While Instana offers extensive advanced features, this guide focused on essential capabilities to help you get started:

Cluster-level monitoring: Infrastructure health and resource utilization
Service-level analysis: Performance metrics, error rates, call flows and dependencies
Endpoint diagnostics: API latency breakdowns and call tracing
Event tracking: Incident detection and alerting

Instana integration enables SWH customers to:

Proactively identify issues before they impact users
Quickly diagnose bottlenecks and root causes
Reduce mean time to resolution (MTTR)
Minimize downtime and operational costs
Gain unified visibility across the entire stack

References

0 comments

50 views

Permalink

https://community.ibm.com/community/user/blogs/yongli-an/2025/12/09/instana-observability-in-ibm-software-hub

Cloud Pak for Data

Cloud Pak for Data

Enhanced Observability in IBM Software Hub with Instana Support

By Yongli An posted 19 hours ago

Introduction

Prerequisite

What’s included in this release

Getting started

Cluster level monitoring

Service and API endpoint level analysis

Cluster level events

Conclusion

References

Permalink

Additional
Resources

Office

Quick Links

Cloud Pak for Data

Cloud Pak for Data

Enhanced Observability in IBM Software Hub with Instana Support

By Yongli An posted 19 hours ago

Introduction

Prerequisite

What’s included in this release

Getting started

Cluster level monitoring

Service and API endpoint level analysis

Cluster level events

Conclusion

References

Permalink

Additional Resources

Office

Quick Links

Additional
Resources