WebSphere Application Server & Liberty

 View Only

What are the most common metrics to monitor from your WAS servers?

By PRASHANTH GUNAPALASINGAM posted Fri January 13, 2023 01:57 PM

  
By: Don Bourne and Prashanth Gunapalasingam

See also: What are the most common metrics to monitor from your Liberty servers?

Recently, we were asked what metrics ops teams should pay attention to from application servers in WebSphere Application Server (WAS). Clearly that depends on what your applications do -- for example, there are metrics for the SIBus or dynamic object cache for WAS servers, that are only relevant if you are using those capabilities. It also depends on what your objectives are for monitoring. Your goal could be to ensure you are meeting response time objectives from your web applications.

This blog is focused on the metrics that are relevant to watch for most web applications, and the conditions to watch for, for the general purpose of ensuring that the users of your apps are not aware of problems. In particular, when users are affected by slow response times, hangs, or errors they are likely to notice. We'll call the key problems to watch for, that directly affect your users, "primary" events.

There are also problems that you can detect from metrics data that indicate something isn't right, but that may not be impacting your users. For example, your Java heap utilization may be high for an extended period of time. These problems may not necessarily mean your users are aware of an issue, but you may want to know about them before they cause bigger problems. We'll call these problems "secondary" events.

The table below lists the metrics that are relevant to watch for most web applications. These metrics can be monitored using JMX clients, which rely on MBeans exposed by the WAS performance monitoring infrastructure (PMI), or with monitoring tools that connect to a `/metrics` endpoint to gather data in Prometheus text-based exposition format. To expose a /metrics endpoint, deploy the metrics.ear application, as stated here. The metrics.ear gets its data from PMI -- if you choose to deploy that application you will also need to enable the specific PMI metrics, as stated here that correspond to the metrics you want to appear in the `/metrics` output.

WebSphere Application Server Metrics

Note: These are the most common metrics to monitor, however, depending on the programming models used in your applications, you may want to track other metrics as well. For a complete list of metrics, see here.

Category

Prometheus Metric Name

PMI ModuleName

PMI Counter Name & ID

Description

Connection Pool

was_connectionpool_inUse_time_seconds_total

connectionPoolModule

UseTime (ID=12)

The total time (in seconds) that a connection was used. The total time is difference between the time at which the connection is allocated and returned. This value includes the JBDC operation time.

Connection Pool

was_connectionpool_wait_time_seconds_total

connectionPoolModule

WaitTime (ID=13)

The total wait time (in seconds) until a connection was granted.

EJB

was_ejb_response_time_seconds_total

beanModule

MethodResponseTime (ID=12)

The total response time in seconds on the remote methods of the bean.

EJB

was_ejb_responses_total

beanModule

MethodResponseTime (ID=12)

The number of times that the remote methods of the bean were processed.

JVM

jvm_memory_used_bytes

jvmRuntimeModule

UsedMemory (ID=3)

The free memory (in bytes) in the Java virtual machine runtime.

JVM

jvm_memory_max_bytes

jvmRuntimeModule

HeapSize (ID=1)

The total memory (in bytes) in the Java virtual machine runtime.

JVM

jvm_gc_duration_seconds_total

jvmRuntimeModule

GCTime (ID=13)

The total time consumed, in seconds in garbage collection.

JVM

process_cpu_utilization

jvmRuntimeModule

ProcessCpuUsage (ID=5)

The CPU Usage (in percent) of the Java virtual machine.

Security

was_authentication_authentications_total

SecurityAuthenticationStats

WebAuthenticationCount (ID=1)

The total number of authentications processed by type.

Security

was_authentication_authentication_time_seconds_total

SecurityAuthenticationStats

WebAuthenticationTime (ID=10)

The total response time (in seconds) for the specified authentication type.

Security

was_authorization_authorizations_total

SecurityAuthorizationStats

WebAuthorizationTime (ID=1)

The total number of authorizations performed for a given authorization type.

Security

was_authorization_authorization_time_seconds_total

SecurityAuthorizationStats

EJBAuthorizationTime (ID=2)

The total response time (in seconds) for a given authorization type.

Thread Pool

was_threadpool_declaredThreadHungs_total

threadPoolModule

DeclaredthreadHungCount (ID=6)

The number of threads that were declared stopped.

Transaction Manager

was_transactionmanager_rolledback_total

transactionModule

RolledbackCount (ID=16)

The number of transactions that were rolled back.

Transaction Manager

was_transactionmanager_timedout_total

transactionModule

GlobalTimeoutCount (ID=18)

The number of transactions that timed out.

Web Application (Servlet)

was_servlet_response_time_seconds_total

webAppModule.servlets

ServiceTime (ID=13)

The total response time (in seconds) to process servlet requests

Web Application (Servlet)

was_servlet_requests_total

webAppModule.servlets

RequestCount (ID=11)

The total number of requests that a servlet processed.

Web Service

was_jaxws_response_time_seconds_total

webServicesModule.services

ResponseTime (ID=14)

The response time, in seconds, between the receipt of a request and the return of the reply.

Web Service

was_jaxws_requests_total

webServicesModule.services

RequestReceivedEndpoint (ID=30)

The total number of requests.

 

Primary Events

The primary events table will help you understand if there is a problem likely to affect the user of the application. It consists of common issues that your users will see, and the corresponding WAS metrics conditions with thresholds to watch out for. The conditions are in PromQL format.

Note: The thresholds provided for the conditions are for example purposes. The thresholds should be adjusted as appropriate for service level objectives.

Name

Conditions

Duration

Description

Threads hanging

rate(was_threadpool_declaredThreadHungs_total[5m]) > 0

> 0 sec

Indicates that threads in the application server are hanging.

Transactions Rolling Back

rate(was_transactionmanager_rolledback_total[5m]) > 0

> 0 sec

Indicates transactions are being rolled back.

Transactions timing out

rate(was_transactionmanager_timedout_total[5m]) > 0

> 0 sec

Indicates transactions are timing out.

Servlets responding slowly

avg_over_time((was_servlet_response_time_seconds_total/was_servlet_requests_total)[5m:5m]) > 1

30 sec

Indicates servlets are responding slowly.

Restful WS responding slowly

avg_over_time((was_jaxws_response_time_seconds_total/was_jaxws_requests_total)[5m:5m]) > 1

30 sec

Indicates restful web services are responding slowly.

EJBs responding slowly

avg_over_time((was_ejb_response_time_seconds_total/was_ejb_responses_total)[5m:5m]) > 1

30 sec

Indicates EJBs are responding slowly.

 

Secondary Events

The secondary events table will help you understand the cause of problems. These events watch for common issues that might occur which might lead to bigger problems, and the corresponding WAS metrics conditions with thresholds to watch out for. The conditions are in PromQL format.

Note: The thresholds provided for the conditions are for example purposes. The thresholds should be adjusted as appropriate for your service level objectives.

Name

Conditions

Duration 

Description

Long use of connections

avg_over_time((was_connectionpool_inUse_time_seconds_total/was_connectionpool_inUse_total)[5m:5m]) > 1

30 sec

Indicates connections are being used for a long time before being returned to the pool.

Connection pool blocking

rate(was_connectionpool_wait_time_seconds_total[5m]) > 0

> 0 sec

Indicates that threads are having to wait to get connections from a connection pool.

Available heap memory low

avg_over_time((jvm_memory_used_bytes/jvm_memory_max_bytes)[5m:5m]) > 0.8

30 sec

Indicates the JVM has been low on available memory for an extended period of time.  May indicate a memory leak or insufficient maximum heap size.

Garbage collection activity is high

avg_over_time(jvm_gc_duration_seconds_total[5m]) > 0.1

30 sec

Indicates the portion of time spent doing garbage collection is unusually high.

CPU utilization is high

avg_over_time(process_cpu_utilization[10m]) > 0.9

10 mins

Indicates the CPU consumed by the process has been high for an extended period of time.

Authentications are slow

avg_over_time((was_authentication_authentication_time_seconds_total/was_authentication_authentications_total)[5m:5m]) > 1

30 sec

Indicates authentications are taking longer than expected.

Authorizations are slow

avg_over_time((was_authorization_authorization_time_seconds_total/was_authorization_authorizations_total)[5m:5m]) > 1

30 sec

Indicates authorizations are taking longer than expected.

 

Let us know in the comments if you have other metrics you think are super important to monitor!

​​#WebSphereApplicationServer(WAS)#WebSphere​​​​​
0 comments
41 views

Permalink