WebSphere Intelligent Management provides a Health Management feature that allows for a policy-driven approach to monitor and respond to the status of application servers. A Health Controller monitoring user-defined health policies can trigger recovery and diagnostic actions to occur on a policy breach, preventing service disruptions by reacting to indications of failing server health.
Using Health Management is fairly straightforward, and starting with WebSphere 8.5 Health Management comes enabled by default. Health policies, which are made up of monitoring and response definitions, can be applied without any changes to existing topology or disruption of service. Health Management can augment current procedures for problem determination and recovery. There is no requirement to replace your existing procedures.
This blog post will cover key concepts in understanding and configuring Health Management in WebSphere; namely, configuring the Health Controller and understanding Health Policies, Health Actions, and policy scope.
The Health Controller is the "brain" of health management in WebSphere, controlling health monitoring and management.
The controller (and resultant monitoring) can be enabled or disabled via the console or through scripting via wsadmin. From the console, navigate to Operational Policies -> Autonomic Managers -> Health Controller (as show in the following image; this pane is also where control cycle length and restart parameters are set).
The health controller runs on a control cycle; a control cycle length defines the amount of time between environment checks initiated by the health controller. At the end of the control cycle, the health controller evaluates health policies and generates runtime tasks to resolve any breaches in policy. The cycle time is configurable via the console (Operational Policies -> Autonomic Managers -> Health Controller) and via wsadmin. To configure the controller with wsadmin, refer to the following Knowledge Center link:
The health controller is a singleton service, managed by the high availability (HA) manager, that runs within the deployment manager or a node agent process in a cell. If for any reason the health controller fails, the HA manager will start the health controller on another running node agent or deployment manager. The health controller location can be found from the administrative console by navigating to Runtime Operations -> Component Stability -> Core Components view; alternatively, you can use the wsadmin and the checkHmmLocation.jacl script (located in install_root/bin/) that ships with the product.
For more information on configuring the Health Controller, visit the following Knowledge Center link:
Health policies define the health conditions to be monitored and the health actions to take if these conditions are not met. Health policies can be accessed and displayed in the console by navigating to Operational Policies -> Health Policies (see the following image).
Health policies can also be managed with wsadmin; the following Knowledge Center page has more information:
The core of a health policy is the condition being monitored. There are two condition options when creating a health policy - using a pre-defined condition or creating a custom health condition. Pre-defined conditions include the following:
- Age based condition- Tracks the amount of time that the server is running. If the amount of time exceeds the defined threshold, the health actions run.
- Memory condition: excessive memory usage –Tracks the memory usage for a member. When the memory usage exceeds a percentage of the heap size for a specified time, health actions run to correct this situation.
- Memory condition: memory leak – Tracks consistent downward trends in free memory that is available to a server in the Java™ heap. When the Java heap approaches the maximum configured size, you can perform either heap dumps or server restarts.
- Garbage collection percentage - Monitors a Java virtual machine (JVM) or set of JVM’s to determine whether they spend more than a defined percentage of time in garbage collection during a specified time period.
The following pre-defined conditions are also available if your topology includes an On Demand Router (ODR):
Note: If using the Intelligent Management enabled plugin, only the Excessive response time and Excessive request timeout are available
- Excessive response time condition – Tracks the amount of time that requests take to complete. If the time exceeds the defined response time threshold, the health actions run.
- Excessive request timeout condition –Specifies a percentage of HTTP requests that can time out. When the percentage of requests exceeds the defined value, the health actions run. The timeout value depends on your environment configuration. For more information about the excessive request timeout health condition, see excessive request timeout health policy target timeout value.
- Storm drain detection - Tracks requests that have a significantly decreased response time. This policy relies on change point detection on given time series data.
- Workload based condition- Specifies a number of requests that are serviced before policy members restart to clean out memory and cache data.
Note: A flavor of On Demand Router is not required to use Health Management, unless you need to use the above ODR conditions or ODR-specific metrics.
Custom health conditions allow for the creation of more complex condition rules. Selecting "Custom Health Condition" in policy creation will take you to the build subexpression utility, where rule conditions can be created from subexpressions by using AND, OR, NOT and parenthetical grouping. The subexpression builder validates the rule when applied, and alerts you to mismatched parentheses and unsupported logic operators. Broadly, the statistics available for custom health conditions include:
- PMI (Performance Monitoring Infrastructure) statistics
- ODR (On Demand Router) statistics (Available if topology includes an ODR)
- Mbean based conditions/ statistics
- URL return code metrics
A wide variety of metrics are available in each of these statistic categories; for a detailed breakdown of metrics, visit the following Knowledge Center page:
After specifying the health condition for a policy, the next step is choosing what actions are taken in response to a policy breach, covered in the next section. For more information on health policies, visit the following Knowledge Center links:
Health actions are issued by the health controller in response to the triggering of a health policy. A health policy can employ more than one health action; actions are executed in the order listed in the policy.
Before diving into health actions, it should be noted that there are 2 reaction modes available for health actions- supervised and automatic. Automatic tasks will be initiated by the health controller automatically when a health policy violation is detected. Supervised tasks require user approval before any action is taken.
The health controller will create a runtime task for supervised actions, which a system administrator can approve or deny from the Runtime Tasks pane, viewable in the console by navigating to System Administration -> Task Management -> Runtime Tasks. See the following image for an example
To make management of supervised tasks easier, email notifications can be enabled. With email notifications, an email will be sent whenever a task is generated. The process for enabling email notifications can be found at the following Knowledge Center link:
If you're concerned that health actions could disrupt your environment, use supervised mode to screen pending actions. This can be useful if you're new to using health management or when tuning health policies for your environment; once policies are tuned and the recommended actions have been verified as correct for the health problems detected, you can transition to automatic mode.
A health policy can use pre-defined health actions or custom health actions.
Available predefined actions are:
- Restart server
- Take thread dumps
- Take Java virtual machine (JVM) heap dumps
- Generate a Simple Network Management Protocol (SNMP) trap
- Place server in maintenance mode
- Place server in maintenance modeand break affinity to the server
- Take server out of maintenance mode
Maintenance mode is a WebSphere feature that can prevent the disruption of client requests by routing client traffic that is targeted for a server or node that is in maintenance mode to another server or node. By default, maintenance mode will still accept requests with affinity unless the "break affinity" option is chosen. Selecting "Break affinity" will stop all traffic to the server. The "Take server out of maintenance mode" action can only be used in a policy where a previous health action has placed the server in maintenance mode.
With a custom health action, you define an executable file to run when the health condition breaches. Custom actions must be defined before they can be used in a health policy.
For more information on creating custom health actions, visit the following Knowledge Center page: https://www.ibm.com/docs/was-nd/9.0.5?topic=policies-creating-health-policy-custom-actions
After defining the condition and actions for a health policy, the next step is defining the policy scope. Policy scope defines which WebSphere resources will be monitored under the health policy- for example, a policy could apply to specific server, or servers in a cell, etc.
A health policy can be defined at one of the following scopes :
- Server/ node scope
- Cluster scope
- Dynamic Cluster scope
- On Demand Router scope
- Cell scope
Once the scope of the health policy has been defined, health policy creation is complete. The policy will take effect and be evaluated during the next health controller cycle.
With Health Management enabled and policies in effect, monitoring and responding to server health issues will reduce service disruptions and ensure the health of your server environment, easing the burden of system administration. Health Management's ability to identify and treat ailing servers before they cause service outages has clear value, and can be used to complement existing procedures for handling server health issues without any changes to existing topology or procedures or disruption of service.
For more information on health management in WebSphere, visit this Knowledge Center page: