WebSphere Application Server & Liberty

 View Only

JSR-352 (Java Batch) Post #164: Kubernetes Jobs - Restart, Backoff, and Deadline

By David Follis posted Thu December 09, 2021 08:05 AM

This post is part of a series delving into the details of the JSR-352 (Java Batch) specification. Each post examines a very specific part of the specification and looks at how it works and how you might use it in a real batch application.

To start at the beginning, follow the link to the first post.

The next post in the series is here.

This series is also available as a podcast on iTunesGoogle PlayStitcher, or use the link to the RSS feed

As a reminder, we’re taking a casual stroll through Kubernetes Jobs.  This week we’ll look at three configuration values you can specify that influence how it behaves.

The first one is the restart policy.  We said last week that the Jobs processing depends a lot on the return value from the command you execute.  As part of the YAML you can also specify a restartPolicy value of Never or OnFailure. 

Obviously Never will never restart and OnFailure will restart on a failure.  But what is a failure?  Well, whenever your command processing returns something bad.  Or if some other failure occurs (like the Pod crashes or is stopped through some other Kubernetes processing).  If your application just blew up and so the whole thing failed, then OnFailure will spin it back up again and you can pick up where you left off.  But if the whole thing died you might not have any little footprints you left about where you were in processing (depending where you kept those).  As we’ve seen in earlier posts, restart processing is tricky.  Just having Kubernetes Jobs restart your command is only the beginning.

Backoff is another interesting configuration option.  This ties into the restart processing.  With Backoff you can configure the number of times it will retry.  That’s a good thing so you don’t end up endlessly retrying something that won’t ever work.  This is similar to the retry limits in the JSR-352 specification.  Although in this case we’re potentially restarting your whole environment.  There was an interesting note in the documentation about Backoff processing also adding in increasing delays between restarts on every try.  I guess the idea is that maybe a failure left some stuff that needs cleaning up or maybe you’re hoping some problem will get resolved and it waits longer and longer between retries hoping things will sort themselves out (maybe it needs some other automation to restart something else).

Finally, I wanted to look at the activeDeadlineSeconds configuration value.  This value applies to the whole job, no matter how many Pods or containers got started or restarted.  This is basically a timeout value to get the job done.  When you exceed this value all the Pods are terminated and the job fails.  Bang…over.  No gentle quiescing or anything.  So while it seems like a good idea to have this drop-dead time value set to keep things from just running away, it sounds like it is a pretty violent end if you go over it.  So set it high.

Alright, so all that’s kind of interesting, but next week we’ll start into the good stuff and look at our first model for running a job.