What are the most common metrics to monitor from your WAS servers?

View Only

What are the most common metrics to monitor from your WAS servers?

By PRASHANTH GUNAPALASINGAM posted Fri January 13, 2023 01:57 PM

Like

By: Don Bourne and Prashanth Gunapalasingam

See also: What are the most common metrics to monitor from your Liberty servers?

Recently, we were asked what metrics ops teams should pay attention to from application servers in WebSphere Application Server (WAS). Clearly that depends on what your applications do -- for example, there are metrics for the SIBus or dynamic object cache for WAS servers, that are only relevant if you are using those capabilities. It also depends on what your objectives are for monitoring. Your goal could be to ensure you are meeting response time objectives from your web applications.

This blog is focused on the metrics that are relevant to watch for most web applications, and the conditions to watch for, for the general purpose of ensuring that the users of your apps are not aware of problems. In particular, when users are affected by slow response times, hangs, or errors they are likely to notice. We'll call the key problems to watch for, that directly affect your users, "primary" events.

There are also problems that you can detect from metrics data that indicate something isn't right, but that may not be impacting your users. For example, your Java heap utilization may be high for an extended period of time. These problems may not necessarily mean your users are aware of an issue, but you may want to know about them before they cause bigger problems. We'll call these problems "secondary" events.

The table below lists the metrics that are relevant to watch for most web applications. These metrics can be monitored using JMX clients, which rely on MBeans exposed by the WAS performance monitoring infrastructure (PMI), or with monitoring tools that connect to a `/metrics` endpoint to gather data in Prometheus text-based exposition format. To expose a /metrics endpoint, deploy the metrics.ear application, as stated here. The metrics.ear gets its data from PMI -- if you choose to deploy that application you will also need to enable the specific PMI metrics, as stated here that correspond to the metrics you want to appear in the `/metrics` output.

WebSphere Application Server Metrics

Note: These are the most common metrics to monitor, however, depending on the programming models used in your applications, you may want to track other metrics as well. For a complete list of metrics, see here.

Category	Prometheus Metric Name	PMI ModuleName	PMI Counter Name & ID	Description
Connection Pool	was_connectionpool_inUse_time_seconds_total	connectionPoolModule	UseTime (ID=12)	The total time (in seconds) that a connection was used. The total time is difference between the time at which the connection is allocated and returned. This value includes the JBDC operation time.
Connection Pool	was_connectionpool_wait_time_seconds_total	connectionPoolModule	WaitTime (ID=13)	The total wait time (in seconds) until a connection was granted.
EJB	was_ejb_response_time_seconds_total	beanModule	MethodResponseTime (ID=12)	The total response time in seconds on the remote methods of the bean.
EJB	was_ejb_responses_total	beanModule	MethodResponseTime (ID=12)	The number of times that the remote methods of the bean were processed.
JVM	jvm_memory_used_bytes	jvmRuntimeModule	UsedMemory (ID=3)	The free memory (in bytes) in the Java virtual machine runtime.
JVM	jvm_memory_max_bytes	jvmRuntimeModule	HeapSize (ID=1)	The total memory (in bytes) in the Java virtual machine runtime.
JVM	jvm_gc_duration_seconds_total	jvmRuntimeModule	GCTime (ID=13)	The total time consumed, in seconds in garbage collection.
JVM	process_cpu_utilization	jvmRuntimeModule	ProcessCpuUsage (ID=5)	The CPU Usage (in percent) of the Java virtual machine.
Security	was_authentication_authentications_total	SecurityAuthenticationStats	WebAuthenticationCount (ID=1)	The total number of authentications processed by type.
Security	was_authentication_authentication_time_seconds_total	SecurityAuthenticationStats	WebAuthenticationTime (ID=10)	The total response time (in seconds) for the specified authentication type.
Security	was_authorization_authorizations_total	SecurityAuthorizationStats	WebAuthorizationTime (ID=1)	The total number of authorizations performed for a given authorization type.
Security	was_authorization_authorization_time_seconds_total	SecurityAuthorizationStats	EJBAuthorizationTime (ID=2)	The total response time (in seconds) for a given authorization type.
Thread Pool	was_threadpool_declaredThreadHungs_total	threadPoolModule	DeclaredthreadHungCount (ID=6)	The number of threads that were declared stopped.
Transaction Manager	was_transactionmanager_rolledback_total	transactionModule	RolledbackCount (ID=16)	The number of transactions that were rolled back.
Transaction Manager	was_transactionmanager_timedout_total	transactionModule	GlobalTimeoutCount (ID=18)	The number of transactions that timed out.
Web Application (Servlet)	was_servlet_response_time_seconds_total	webAppModule.servlets	ServiceTime (ID=13)	The total response time (in seconds) to process servlet requests
Web Application (Servlet)	was_servlet_requests_total	webAppModule.servlets	RequestCount (ID=11)	The total number of requests that a servlet processed.
Web Service	was_jaxws_response_time_seconds_total	webServicesModule.services	ResponseTime (ID=14)	The response time, in seconds, between the receipt of a request and the return of the reply.
Web Service	was_jaxws_requests_total	webServicesModule.services	RequestReceivedEndpoint (ID=30)	The total number of requests.

Primary Events

The primary events table will help you understand if there is a problem likely to affect the user of the application. It consists of common issues that your users will see, and the corresponding WAS metrics conditions with thresholds to watch out for. The conditions are in PromQL format.

Note: The thresholds provided for the conditions are for example purposes. The thresholds should be adjusted as appropriate for service level objectives.

Name	Conditions	Duration	Description
Threads hanging	rate(was_threadpool_declaredThreadHungs_total[5m]) > 0	> 0 sec	Indicates that threads in the application server are hanging.
Transactions Rolling Back	rate(was_transactionmanager_rolledback_total[5m]) > 0	> 0 sec	Indicates transactions are being rolled back.
Transactions timing out	rate(was_transactionmanager_timedout_total[5m]) > 0	> 0 sec	Indicates transactions are timing out.
Servlets responding slowly	avg_over_time((was_servlet_response_time_seconds_total/was_servlet_requests_total)[5m:5m]) > 1	30 sec	Indicates servlets are responding slowly.
Restful WS responding slowly	avg_over_time((was_jaxws_response_time_seconds_total/was_jaxws_requests_total)[5m:5m]) > 1	30 sec	Indicates restful web services are responding slowly.
EJBs responding slowly	avg_over_time((was_ejb_response_time_seconds_total/was_ejb_responses_total)[5m:5m]) > 1	30 sec	Indicates EJBs are responding slowly.

Secondary Events

The secondary events table will help you understand the cause of problems. These events watch for common issues that might occur which might lead to bigger problems, and the corresponding WAS metrics conditions with thresholds to watch out for. The conditions are in PromQL format.

Note: The thresholds provided for the conditions are for example purposes. The thresholds should be adjusted as appropriate for your service level objectives.

Name	Conditions	Duration	Description
Long use of connections	avg_over_time((was_connectionpool_inUse_time_seconds_total/was_connectionpool_inUse_total)[5m:5m]) > 1	30 sec	Indicates connections are being used for a long time before being returned to the pool.
Connection pool blocking	rate(was_connectionpool_wait_time_seconds_total[5m]) > 0	> 0 sec	Indicates that threads are having to wait to get connections from a connection pool.
Available heap memory low	avg_over_time((jvm_memory_used_bytes/jvm_memory_max_bytes)[5m:5m]) > 0.8	30 sec	Indicates the JVM has been low on available memory for an extended period of time. May indicate a memory leak or insufficient maximum heap size.
Garbage collection activity is high	avg_over_time(jvm_gc_duration_seconds_total[5m]) > 0.1	30 sec	Indicates the portion of time spent doing garbage collection is unusually high.
CPU utilization is high	avg_over_time(process_cpu_utilization[10m]) > 0.9	10 mins	Indicates the CPU consumed by the process has been high for an extended period of time.
Authentications are slow	avg_over_time((was_authentication_authentication_time_seconds_total/was_authentication_authentications_total)[5m:5m]) > 1	30 sec	Indicates authentications are taking longer than expected.
Authorizations are slow	avg_over_time((was_authorization_authorization_time_seconds_total/was_authorization_authorizations_total)[5m:5m]) > 1	30 sec	Indicates authorizations are taking longer than expected.

Let us know in the comments if you have other metrics you think are super important to monitor!

#WebSphereApplicationServer(WAS)#WebSphere

0 comments

41 views

WebSphere Application Server & Liberty

What are the most common metrics to monitor from your WAS servers?

By PRASHANTH GUNAPALASINGAM posted Fri January 13, 2023 01:57 PM

Permalink