IBM TechXchange Virtual WebSphere z/OS User Group

 View Only

Liberty z/OS Post #61- WLM Health API

By David Follis posted Thu August 15, 2024 08:33 AM

  

This post is part of a series exploring the unique aspects and capabilities of WebSphere Liberty when running on z/OS.
We'll also explore considerations when moving from WebSphere traditional on z/OS to Liberty on z/OS.

The next post in the series is here.

To start at the beginning, follow this link to the first post.

---------------

This Liberty z/OS feature is basically the same as in WebSphere traditional on z/OS.  If you’re familiar with that, take this week off and go read somebody else’s blog…

Ok, so it seems like forever ago that there was work going on in z/OS to try to address problems with storm drains.  A storm drain is when a server is able to receive and process requests, but can’t successfully process them.  Basically every bit of work it takes in fails immediately.  Because failing immediately is generally faster than actually doing the work, there’s a tendency for work routing mechanisms in front of a set of clustered servers to send more work to the one where things are failing…because it yields better response times.  This is, of course, bad.  Various things got done to try to help the work routers upstream understand that things weren’t going well in a server with a problem and to stop sending it even more things to do. 

One such item was an API provided by WLM that allowed a server to indicate how healthy it was, on a scale of 0 to 100.  Presumably WLM would use this information when it provides advice to work routers about where to send work (like Sysplex Distributor).  WebSphere traditional was asked to find a way to exploit this new service. 

And so meetings were held to try to figure out how to monitor the ‘health’ of a WAS server and translate that into a value between 0 and 100 in a way that reflected the server’s ability to handle work correctly.  It turns out to be really hard to do that.

But along the way we realized that we could use it to indicate a server was ‘warming up’.  Talking to our friends in performance, they explained that before they do a measurement they spend about 5 minutes running work through the server just to get the caches populated and let the JIT do its thing etc.  Then the measurement captures data for ‘production’ execution instead of ‘startup’.  It seemed to us that during that warmup phase you might want the server to take less work.  You can’t not take any work because that won’t warm it up, but you might want to limit the volume until it is actually ready to take its full share of the production workload.

And so we decided to start the server off with a health of zero and increment it over time as we warmed up until we got to 100 percent healthy.  Awesome.  So…how often should we increment the health and by how much?  Hmm…good question.  We had no idea and it probably varies with the environment and applications and a lot of other things. 

Thus we created two new configuration values to let you set the interval and increment amount to whatever you like.  So you could have the health increment by 10 every 30 seconds, or increment by 50 every 60 seconds, or whatever you like.

How to know?  Well, probably the first thing to do is to have some idea how much ‘warming up’ your server needs in your environment.  Start it up normally, under load, and watch the response times.  If they get better over time, that’s your warming up process in action.  How long does that take?  Experiment with using the health value configuration to restrict the amount of work coming in and see if it still warms up about as fast (hopefully it doesn’t need a heavy load to warm up).  Remember the goal isn’t to shorten the warm up time, but to limit the amount of work going to the server until it gets warmed up and yields its best response times.  You want to limit it to just enough to get it warmed up. 

Anyway, that was all in WAS traditional on z/OS and eventually we ported it to Liberty.

0 comments
3 views

Permalink