WebSphere Application Server & Liberty

 View Only

What are the most common metrics to monitor from your Liberty servers?

By PRASHANTH GUNAPALASINGAM posted Wed January 18, 2023 10:46 AM

  
By: Don Bourne and Prashanth Gunapalasingam

See also: What are the most common metrics to monitor from your WAS servers?

Recently, we were asked what metrics ops teams should pay attention to from Liberty servers. Clearly that depends on what your applications do -- for example, there are metrics for MicroProfile Fault Tolerance on Liberty, that are only relevant if you are using those capabilities. It also depends on what your objectives are for monitoring. Your goal could be to maximize the number of Liberty containers that you can fit in your Kubernetes cluster, or it could be to ensure you are meeting response time objectives from your web applications.

This blog is focused on the metrics that are relevant to watch for most web applications, and the conditions to watch for, for the general purpose of ensuring that the users of your applications are not aware of problems. In particular, when users are affected by slow response times, hangs, or errors they are likely to notice. We'll call the key problems to watch for, that directly affect your users, "primary" events.

There are also problems that you can detect from metrics data that indicate something isn't right, but that may not be impacting your users. For example, your Java heap utilization may be high for an extended amount of time. These problems may not necessarily mean your users are aware of an issue, but you may want to know about them before they cause bigger problems. We'll call these "secondary" events.

The table below lists the metrics that are relevant to watch for most web applications. You can refer here on how to enable and configure the MicroProfile Metrics feature in your server configuration. Metrics are returned from your server’s /metrics endpoint in Prometheus format. By default, all monitoring components are enabled. If your server is collecting more metrics data than you need, you can improve the server performance by collecting only those vendor metrics that you intend to use. To configure only a subset of vendor metrics to be reported, specify the components that you want to monitor in the “filter” attribute for the “monitor” configuration element in your “server.xml” file. The following “filter” attribute specifies the components needed to be monitored to include the Liberty metrics listed in the tables below.

<monitor filter="ConnectionPool,RequestTiming,REST,WebContainer"/>

Liberty Metrics
Note: These are the most common metrics to monitor, however, depending on the programming models used in your applications, you may want to track other metrics as well. You can find all the available Liberty metrics here.

Category

MicroProfile Metrics 3.x-4.x

Prometheus Metric Name

Description

Connection Pool

vendor_connectionpool_queuedRequests_total

The total number of connection requests that waited for a connection because of a full connection pool since the start of the server. This metric is a counter.

Connection Pool

vendor_connectionpool_inUseTime_total_seconds

The total time that all connections are in-use since the start of the server. This metric is a gauge.

Connection Pool

vendor_connectionpool_usedConnections_total

The total number of connection requests that waited because of a full connection pool or did not wait since the start of the server. Any connections that are currently in use are not included in this total. This metric is a counter.

JVM

base_cpu_processCpuTime_seconds

The CPU time for the JVM process. This metric is a gauge.

JVM

base_cpu_availableProcessors

The number of processors available to the JVM. This metric is a gauge.

JVM

base_memory_usedHeap_bytes

The amount of used heap memory. This metric is a gauge.

JVM

base_memory_maxHeap_bytes

The maximum amount of heap memory that can be used for memory management. This metric displays -1 if the maximum heap memory size is undefined. This amount of memory is not guaranteed to be available for memory management if it is greater than the amount of committed memory. This metric is a gauge.

JVM

base_gc_time_seconds

The approximate accumulated garbage collection elapsed time. This metric displays -1 if the garbage collection elapsed time is undefined for this collector. This metric is a gauge.

Request Timing

vendor_requestTiming_hungRequestCount

The number of servlet requests that are currently running but are hung. This metric is a gauge.

Restful Web Services

base_REST_request_elapsedTime_seconds

The total response time of this RESTful resource method since the server started. The metric doesn’t record the count the elapsed time if an unmapped exception occurs.

Restful Web Services

base_REST_request_total

The number of invocations of this RESTful resource method since the server started. The metric doesn’t record the count of invocations if an unmapped exception occurs.

Restful Web Services

base_REST_request_unmappedException_total

The total number of unmapped exceptions that occur from this RESTful resource method since the server started. This metric is a counter.

Web Container

vendor_servlet_responseTime_total_seconds

The total of the servlet response time since the start of the server. This metric is a gauge.

Web Container

vendor_servlet_request_total

The total number of visits to this servlet since the start of the server. This metric is a counter.


Primary Events
The primary events table will help you understand if there is a problem likely to affect the user of the application. It consists of common issues that your users will see, and the corresponding Liberty metrics conditions with thresholds to watch out for. The conditions are in PromQL format.

Note: The thresholds provided for the conditions are for example purposes. These threshold should be adjusted as appropriate for service level objectives.

Name

Conditions

Duration

Description

Servlets responding slowly

avg_over_time((vendor_servlet_responseTime_total_seconds/vendor_servlet_request_total)[5m:5m]) > 1

30 sec

Indicates servlets are responding slowly.

Servlet requests hanging

avg_over_time((vendor_requestTiming_hungRequestCount[5m])) > 0

30 sec

Indicates servlets are hanging.

Restful WS responding slowly

avg_over_time((base_REST_request_elapsedTime_seconds/base_REST_request_total)[5m:5m]) > 1

30 sec

Indicates restful web services are responding slowly.

Restful WS exceptions

rate(base_REST_request_unmappedException_total[5m]) > 0

> 0 sec

Indicates restful web services are throwing unmapped exceptions.


Secondary Events
The secondary events table will help you understand the cause of the problems. It consists of common issues that might occur which might lead to bigger problems, and the corresponding Liberty metrics conditions with thresholds to watch out for. The conditions are in PromQL format.

Note: The thresholds provided for the conditions are for example purposes. These threshold should be adjusted as appropriate for service level objectives.

Name

Conditions

Duration

Description

CPU utilization is high

avg_over_time((rate(base_cpu_processCpuTime_seconds[5m])/base_cpu_availableProcessors)[5m:5m]) > 0.9

30 sec

Indicates the CPU consumed by the process has been high for an extended period of time.

Available heap memory low

avg_over_time((base_memory_usedHeap_bytes / base_memory_maxHeap_bytes)[5m:5m]) > 0.8

30 sec

Indicates the JVM has been low on available memory for an extended period of time.  May indicate a memory leak or insufficient maximum heap size.

Garbage collection activity is high

avg_over_time(base_gc_time_seconds[5m]) > 0.1

30 sec

Indicates the portion of time spent doing garbage collection is unusually high

Connection pool blocking

rate(vendor_connectionpool_queuedRequests_total[5m])> 0

> 0 sec

Indicates that threads are having to wait to get connections from a connection pool

Long use of connections

avg_over_time((vendor_connectionpool_inUseTime_total_seconds/vendor_connectionpool_usedConnections_total)[5m:5m]) > 1

30 sec

Indicates connections are being used for a long time before being returned to the pool


Let us know in the comments if you have other metrics you think are super important to monitor!

#OpenLiberty
#WebSphereLiberty#MicroProfile​​​
0 comments
53 views

Permalink