Cloud Pak for Data

Cloud Pak for Data

Come for answers. Stay for best practices. All we’re missing is you.

 View Only

Enhanced Observability in IBM Software Hub with Instana Support

By Yongli An posted 19 hours ago

  

Introduction  

In the world of software applications, observability allows you to understand what’s happening inside your applications and infrastructure by examining the most important signals from telemetry data: logs, metrics, traces, and application flows.  

Instana is IBM’s real-time full-stack observability solution. This enterprise-ready platform has gained widespread adoption and trust among organizations worldwide due to its comprehensive monitoring capabilities.

As IBM Software Hub (SWH) deployments grow in scale and complexity, understanding application behavior becomes critical. With increasing customer demand for better observability, integrating Instana support into SWH and its services became a strategic priority. Having completed Instana integration across our core services, we now officially offer Instana support in SWH 5.3.

Prerequisite  

To leverage Instana integration, you need:

  1. Instana Server Access: Typically centralized and shared across your organization. IBM recommends using Instana SaaS for convenience.
  2. Instana Agent Installation: Deploy the Instana Agent on your OpenShift cluster where SWH runs.
    Setting up and maintaining the Instana server and agents is beyond this blog’s scope. This guide assumes you have access to an existing Instana server. For agent installation instructions, see the Instana documentation for installing agent
  3. Enabling Instana Integration: Once prerequisites are met, enable metric collection by adding this flag to your Service Custom Resource:
    enableInstanaMetricCollection: true

What’s included in this release  

 

In SWH 5.3, about 60 services now support Instana, offering comprehensive coverage and a standardized implementation to ensure a consistent user experience. 

Simple configuration

Using a newly introduced standard flag, you can enable or disable Instana metric collection via the service Custom Resource (CR). The system handles all configuration automatically—no complex setup required.

Rich technology support

Most of the technology used by the services in SWH are supported by Instana, with very rare exceptions for which you can refer to the SWH documentation for details.  Instana supports most technologies used in SWH services, including:

  • Node.js, Python, Go, Java
  • PostgreSQL/EDB, MongoDB
  • Nginx

Observability Coverage

Instana provides insights into:

  • Service-level metrics and overall performance
  • API call flows with detailed latency analysis
  • Internal dependencies and component relationships
  • Database query efficiency and performance
  • Resource utilization trends and patterns

Getting started 

Instana complements existing monitoring tools—such as Prometheus queries, OpenShift dashboards, and the SWH monitoring console—by providing broader visibility across all system layers. This unified approach delivers a holistic view of your entire environment, from infrastructure to application-level metrics. 

This section provides UI examples demonstrating how to use Instana for observability across different layers, from infrastructure to application level:

  • Cluster level monitoring
    View infrastructure health, resource utilization, and overall cluster performance.
  • Service and API endpoint level analysis
    Monitor individual services, API performance, and request flows.
  • Cluster level events
    Track cluster events, alerts, and anomalies across your environment.

Cluster level monitoring  

After logging into Instana, from the Instana home page, follow these steps to view your cluster level metrics:

  1. Click Platforms in the left navigation pane
  2. Select Kubernetes to open the Kubernetes monitoring page
  3. (Optional) Switch to table view by clicking the list icon in the upper-right corner
  4. Locate your cluster using the search bar or by browsing the list

Important: The cluster name displayed in Instana matches the name or label defined in your Instana Agent configuration YAML, which may differ from your OpenShift cluster name.

Example: As shown below, if you configured your Instana Agent with the cluster tag d107-DEMO, search for d107 to locate your cluster in the list.

Tip: If you're unsure of your cluster's Instana name, check the cluster.name or zone.name field in your Agent configuration YAML.

A screenshot of a computer

AI-generated content may be incorrect. 

As you can see, my cluster is one of the 31 clusters being connected to the shared Instana server instance. This view shows some high-level information including the number of nodes, services, and pods that are running, with the last column indicating the cluster is healthy or not. 

Click the name and you are now at the Kubernetes level summary view showing more details, including stats such as “CPU/Memory requests,” “CPU/Memory limits allocation,” and “Pods allocation” in percentages, and some graphs with historical data for the same metrics in absolute numbers. 

Cluster list view

The Kubernetes platform page displays all clusters connected to your Instana server. In this example, 31 clusters are monitored by the shared Instana instance. The list view provides high-level information for each cluster:

  • Number of nodes
  • Number of services
  • Number of pods
  • Health status indicator

Cluster detail view

Click on your cluster name to access the detailed Kubernetes cluster overview. This view displays:

Resource allocation metrics (percentages):

  • CPU requests
  • Memory requests
  • CPU limits
  • Memory limits
  • Pod allocation

Historical trends (absolute values):

Time-series graphs show the same metrics over time, allowing you to identify patterns and potential resource constraints.

Tip: Use the percentage metrics to quickly assess resource utilization, and refer to the historical graphs to understand trends and plan capacity.

A screenshot of a computer

AI-generated content may be incorrect. 

Exploring cluster details

Additional tabs at the top of the page provide deeper insights into your cluster. For example, the Nodestab displays detailed metrics for each node, including resource utilization, capacity, and health status on each node.

Other available tabs include Namespaces, Deployments, Pods, Services, and Events—each offering specific views into different aspects of your cluster.

A screenshot of a computer

AI-generated content may be incorrect. 

Viewing service pods

The Pods tab lists all running pods with status, resource usage, and restart counts. To examine a service specifically in SWH:

1. Filter by the Namespace column
2. Select your SWH namespace (e.g., zen)
3. Click a service pod to see more details of the pod

Note: The SWH namespace name may vary based on your installation configuration.

A screenshot of a computer

AI-generated content may be incorrect. 

Persistent Volumes tab: storage monitoring

The Persistent Volumes tab provides critical storage metrics for your cluster:

  • Capacity: Total storage allocated to each PV
  • Usage: Current storage consumed (in MiB/GiB) 
  • Utilization: Percentage of capacity used

Monitoring storage health

Sort by the Utilization column (descending) to quickly identify volumes approaching capacity limits. High utilization (>80%) may indicate the need for:

  • Storage expansion
  • Data cleanup or archiving
  • Investigation of unexpected growth

Why this matters for SWH
Many services (databases, logging, data stores) in SWH rely on persistent storage. Monitoring PV utilization helps prevent service disruptions due to storage exhaustion.

 

A screenshot of a computer

AI-generated content may be incorrect.  

Service and API endpoint level analysis  

Service-level observability provides detailed insights into individual service performance, including API response times, error rates, throughput, and dependencies. This granular view helps identify bottlenecks and troubleshoot issues that may not be visible at the cluster level.

Navigate to service details

Instana offers multiple paths to service-level metrics. The quickest method:

  1. Click Applications in the left navigation bar
  2. Filter by your cluster name to narrow the view
  3. Select the Summary tab

The Summary view displays:

  • Service call rates and response times
  • Error rates and types
  • Top endpoints by traffic and latency

A screenshot of a phone

AI-generated content may be incorrect. 

The Summary page as shown above (top potion of the page) displays key performance metrics for all services in your cluster, including:

  • Call rates (requests per second)
  • Response times and latency
  • Error rates

The Calls section offers two view modes:

  • HTTP status codes: Shows request distribution by status (2xx, 4xx, 5xx) to identify errors
  • Call count: Displays total requests per second for each service

As shown below, switch to Call count view to:

  • Evaluate overall system throughput
  • Monitor load levels across services
  • Identify high-traffic services that may need scaling
  • Establish baseline performance metrics

A screenshot of a phone

AI-generated content may be incorrect. 

Key performance metrics

Instana tracks three critical service-level metrics:

  1. Latency: Response time for service calls
  2. Calls: Request volume and throughput
  3. Error Rate: Percentage of failed requests

Top services ranking

The lower section of the Top services page displays services ranked by your selected metric. Use the metric selector to switch between:

  • Top services by Latency: Identifies slow services affecting user experience
  • Top services by Calls: Shows highest-traffic services consuming resources
  • Top services by Error Rate: Highlights services with reliability issues

 

A screenshot of a computer

AI-generated content may be incorrect. 

Viewing all services

Access the complete services list via either:

  • Click View all services in the Summary view, or
  • Select the Services tab at the top of the page

Sorting by metrics: the Services tab displays multiple metrics including latency, call volume, and error counts. Click any column header to sort by that metric (click again to reverse order).

Example: to find services experiencing the most failures, click the Erroneous calls column header to sort in descending order, the services with the highest error counts appear at the top.

In the example shown in the screenshots below , runtime-manager-api-container shows the highest error count, indicating it requires investigation.

 

A screenshot of a computer

AI-generated content may be incorrect. 

Click the service name of your interest from the above view. A summary view for one service is shown, similar to the service level overall summary view, but now only for the “runtime-manager-api-container” service, an example as shown below: 

A screenshot of a computer

AI-generated content may be incorrect.

 

Endpoint level analysis

For deeper insights, Instana provides endpoint-level metrics and latency breakdowns showing performance across the entire call stack. Click Analyze Calls marked by the red arrow in the above screenshot, you will see a list of sections grouped by endpoint name for the selected service. In this example, runtime-manager-api-container shows 10 endpoint groups, sortable by:

  • Call volume
  • Latency (mean, p95, p99)
  • Error rate
The error rate graph (center, top section) reveals that the GetJob endpoint experienced 100% errors between 11:00 AM and 12:00 PM, indicating complete service failure during that interval. These failures contribute to the 8.17% overall error rate shown in the endpoint list, calculated across the entire time window.

A screenshot of a computer

AI-generated content may be incorrect. 

Click the GetJob group, and the section will open to show all the endpoint API calls so we can understand why there is an 8.17% failure rate. 

Based on the view below, sorted by timestamp in descending order, you can see a noticeable pattern: starting from 11:37:09, above the red line, all the calls took more than 60 seconds to complete, which seems to indicate timeout errors with a 60-second timeout setting, while the rest of the entries before that timestamp seem to have normal latency. 

Now let's drill down to the individual end point calls.  Click the `GetJob` endpoint group to expand and view all individual API calls to investigate the root cause of the 8.17% error rate.  Sorting by timestamp (descending) reveals a clear pattern:

  • After 11:37:09: All calls exceeded 60 seconds, indicating timeout failures (likely a 60-second timeout configuration)
  • Before 11:37:09: Calls completed with normal latency (typically under 5 seconds)

This pattern suggests a sudden performance degradation starting at 11:37:09, causing all subsequent requests to time out.

A screenshot of a computer

AI-generated content may be incorrect. 

To further investigate the root cause for the sudden performance degradation:  

  1. Click any timed-out call to view its trace
  2. Examine the call stack to identify slow components or any error details

The example below shows “Internal Server Error” in the “Logs” section. This further confirms that some services are no longer available, causing the failures.  Potential causes for such failures could include the following but will need additional investigation to confirm further:

  • Database query delays
  • External service dependencies
  • Resource contention (CPU, memory, I/O)
  • Network issues

A screenshot of a computer

AI-generated content may be incorrect. 

Next, the service call flow visualization maps dependencies and communication patterns between services, making it easy to:

  • Trace request paths across multiple services
  • Identify performance bottlenecks in service chains
  • Understand cascading failure impacts

This view can be reached from the service level summary view for a given service (which was shown earlier). To help you recognize where to start, here is the top portion of the UI page: 

A screenshot of a computer

AI-generated content may be incorrect. 

To view the service dependency diagram, switch from the Summary tab to the Flow tab at the top of the page. The Flow view displays an interactive visualization showing:

  • Service dependencies and relationships
  • Call direction and volume
  • Latency at each service hop
  • Error rates across the service chain

This diagram helps you trace request paths and identify where delays or failures occur in multi-service transactions.  

A screenshot of a computer

AI-generated content may be incorrect.  

This example shows a simple service flow. In practice, some services in SWH have far more complex dependency and flow graphs with much more interconnected services. 

Cluster level events  

You can access cluster events view by clicking Events in the left sidebar, which will lead you to the Incidents dashboard, which displays alerts and anomalies across all monitored clusters. If you monitor multiple clusters, use the filter dropdown (top of page) to narrow the view to your specific cluster. This helps you focus on relevant incidents without noise from other environments.

A screenshot of a computerAI-generated content may be incorrect. 

There are other tabs for different events that may be helpful to explore, but they are beyond the scope of this introductory blog post. 

Want to see Instana in action? Check out our quick start demo.

Conclusion 

Instana provides comprehensive real-time observability for IBM Software Hub deployments, from infrastructure monitoring to API-level performance analysis. While Instana offers extensive advanced features, this guide focused on essential capabilities to help you get started:

  • Cluster-level monitoring: Infrastructure health and resource utilization
  • Service-level analysis: Performance metrics, error rates, call flows and dependencies
  • Endpoint diagnostics: API latency breakdowns and call tracing
  • Event tracking: Incident detection and alerting

Instana integration enables SWH customers to:

  • Proactively identify issues before they impact users
  • Quickly diagnose bottlenecks and root causes
  • Reduce mean time to resolution (MTTR)
  • Minimize downtime and operational costs
  • Gain unified visibility across the entire stack

0 comments
50 views

Permalink