WebSphere Application Server & Liberty

Join this online group to communicate across IBM product users and experts by sharing advice and best practices with peers and staying up to date regarding product enhancements.

View Only

Back to Blog List

What are the most common metrics to monitor from your Liberty servers?

By PRASHANTH GUNAPALASINGAM posted Wed January 18, 2023 10:46 AM

By: Don Bourne and Prashanth Gunapalasingam

See also: What are the most common metrics to monitor from your WAS servers?

Recently, we were asked what metrics ops teams should pay attention to from Liberty servers. Clearly that depends on what your applications do -- for example, there are metrics for MicroProfile Fault Tolerance on Liberty, that are only relevant if you are using those capabilities. It also depends on what your objectives are for monitoring. Your goal could be to maximize the number of Liberty containers that you can fit in your Kubernetes cluster, or it could be to ensure you are meeting response time objectives from your web applications.

This blog is focused on the metrics that are relevant to watch for most web applications, and the conditions to watch for, for the general purpose of ensuring that the users of your applications are not aware of problems. In particular, when users are affected by slow response times, hangs, or errors they are likely to notice. We'll call the key problems to watch for, that directly affect your users, "primary" events.

There are also problems that you can detect from metrics data that indicate something isn't right, but that may not be impacting your users. For example, your Java heap utilization may be high for an extended amount of time. These problems may not necessarily mean your users are aware of an issue, but you may want to know about them before they cause bigger problems. We'll call these "secondary" events.

The table below lists the metrics that are relevant to watch for most web applications. You can refer here on how to enable and configure the MicroProfile Metrics feature in your server configuration. Metrics are returned from your server’s /metrics endpoint in Prometheus format. By default, all monitoring components are enabled. If your server is collecting more metrics data than you need, you can improve the server performance by collecting only those vendor metrics that you intend to use. To configure only a subset of vendor metrics to be reported, specify the components that you want to monitor in the “filter” attribute for the “monitor” configuration element in your “server.xml” file. The following “filter” attribute specifies the components needed to be monitored to include the Liberty metrics listed in the tables below.

<monitor filter="ConnectionPool,RequestTiming,REST,WebContainer"/>

Liberty Metrics
Note: These are the most common metrics to monitor, however, depending on the programming models used in your applications, you may want to track other metrics as well. You can find all the available Liberty metrics here.

Category	MicroProfile Metrics 3.x-4.x Prometheus Metric Name	Description
Connection Pool	vendor_connectionpool_queuedRequests_total	The total number of connection requests that waited for a connection because of a full connection pool since the start of the server. This metric is a counter.
Connection Pool	vendor_connectionpool_inUseTime_total_seconds	The total time that all connections are in-use since the start of the server. This metric is a gauge.
Connection Pool	vendor_connectionpool_usedConnections_total	The total number of connection requests that waited because of a full connection pool or did not wait since the start of the server. Any connections that are currently in use are not included in this total. This metric is a counter.
JVM	base_cpu_processCpuTime_seconds	The CPU time for the JVM process. This metric is a gauge.
JVM	base_cpu_availableProcessors	The number of processors available to the JVM. This metric is a gauge.
JVM	base_memory_usedHeap_bytes	The amount of used heap memory. This metric is a gauge.
JVM	base_memory_maxHeap_bytes	The maximum amount of heap memory that can be used for memory management. This metric displays -1 if the maximum heap memory size is undefined. This amount of memory is not guaranteed to be available for memory management if it is greater than the amount of committed memory. This metric is a gauge.
JVM	base_gc_time_seconds	The approximate accumulated garbage collection elapsed time. This metric displays -1 if the garbage collection elapsed time is undefined for this collector. This metric is a gauge.
Request Timing	vendor_requestTiming_hungRequestCount	The number of servlet requests that are currently running but are hung. This metric is a gauge.
Restful Web Services	base_REST_request_elapsedTime_seconds	The total response time of this RESTful resource method since the server started. The metric doesn’t record the count the elapsed time if an unmapped exception occurs.
Restful Web Services	base_REST_request_total	The number of invocations of this RESTful resource method since the server started. The metric doesn’t record the count of invocations if an unmapped exception occurs.
Restful Web Services	base_REST_request_unmappedException_total	The total number of unmapped exceptions that occur from this RESTful resource method since the server started. This metric is a counter.
Web Container	vendor_servlet_responseTime_total_seconds	The total of the servlet response time since the start of the server. This metric is a gauge.
Web Container	vendor_servlet_request_total	The total number of visits to this servlet since the start of the server. This metric is a counter.

Primary Events
The primary events table will help you understand if there is a problem likely to affect the user of the application. It consists of common issues that your users will see, and the corresponding Liberty metrics conditions with thresholds to watch out for. The conditions are in PromQL format.

Note: The thresholds provided for the conditions are for example purposes. These threshold should be adjusted as appropriate for service level objectives.

Name	Conditions	Duration	Description
Servlets responding slowly	avg_over_time((vendor_servlet_responseTime_total_seconds/vendor_servlet_request_total)[5m:5m]) > 1	30 sec	Indicates servlets are responding slowly.
Servlet requests hanging	avg_over_time((vendor_requestTiming_hungRequestCount[5m])) > 0	30 sec	Indicates servlets are hanging.
Restful WS responding slowly	avg_over_time((base_REST_request_elapsedTime_seconds/base_REST_request_total)[5m:5m]) > 1	30 sec	Indicates restful web services are responding slowly.
Restful WS exceptions	rate(base_REST_request_unmappedException_total[5m]) > 0	> 0 sec	Indicates restful web services are throwing unmapped exceptions.

Secondary Events
The secondary events table will help you understand the cause of the problems. It consists of common issues that might occur which might lead to bigger problems, and the corresponding Liberty metrics conditions with thresholds to watch out for. The conditions are in PromQL format.

Note: The thresholds provided for the conditions are for example purposes. These threshold should be adjusted as appropriate for service level objectives.

Name	Conditions	Duration	Description
CPU utilization is high	avg_over_time((rate(base_cpu_processCpuTime_seconds[5m])/base_cpu_availableProcessors)[5m:5m]) > 0.9	30 sec	Indicates the CPU consumed by the process has been high for an extended period of time.
Available heap memory low	avg_over_time((base_memory_usedHeap_bytes / base_memory_maxHeap_bytes)[5m:5m]) > 0.8	30 sec	Indicates the JVM has been low on available memory for an extended period of time. May indicate a memory leak or insufficient maximum heap size.
Garbage collection activity is high	avg_over_time(base_gc_time_seconds[5m]) > 0.1	30 sec	Indicates the portion of time spent doing garbage collection is unusually high
Connection pool blocking	rate(vendor_connectionpool_queuedRequests_total[5m])> 0	> 0 sec	Indicates that threads are having to wait to get connections from a connection pool
Long use of connections	avg_over_time((vendor_connectionpool_inUseTime_total_seconds/vendor_connectionpool_usedConnections_total)[5m:5m]) > 1	30 sec	Indicates connections are being used for a long time before being returned to the pool

Let us know in the comments if you have other metrics you think are super important to monitor!

#OpenLiberty
#WebSphereLiberty #MicroProfile

0 comments

57 views

Permalink

https://community.ibm.com/community/user/blogs/prashanth-gunapalasingam1/2023/01/13/what-are-the-common-metrics-to-monitor-in-your-lib

WebSphere Application Server & Liberty

WebSphere Application Server & Liberty

What are the most common metrics to monitor from your Liberty servers?

By PRASHANTH GUNAPALASINGAM posted Wed January 18, 2023 10:46 AM

Permalink

Additional
Resources

Office

Quick Links

WebSphere Application Server & Liberty

WebSphere Application Server & Liberty

What are the most common metrics to monitor from your Liberty servers?

By PRASHANTH GUNAPALASINGAM posted Wed January 18, 2023 10:46 AM

Permalink

Additional Resources

Office

Quick Links

Additional
Resources