WebSphere Application Server & Liberty

 View Only

JSR-352 (Java Batch) Post #165: Kubernetes Jobs – One Pod, One Job

By David Follis posted Thu January 06, 2022 09:04 AM

This post is part of a series delving into the details of the JSR-352 (Java Batch) specification. Each post examines a very specific part of the specification and looks at how it works and how you might use it in a real batch application.

To start at the beginning, follow the link to the first post.

The next post in the series is here.

This series is also available as a podcast on iTunesGoogle PlayStitcher, or use the link to the RSS feed

There are two values that control concurrency and completion for a job.  These are spec.completions and spec.parallelism.  For our discussion this week we’re going to let them both default to one. 

That means you’re going to get one pod started to run the command you specified.  If it finishes successfully, then you’re done.  If it doesn’t, then (depending on the restart setting we talked about last week) it might get restarted until it works.

This model is a very traditional way of looking at things and so it makes sense to default to this.  In the following weeks we’ll look at some interesting variations where we change these two configuration values.

So this means that whatever command you issue, it has to startup something that is going to do the entire work for the job, even if it consists of multiple ‘steps’ (whether those are JSR-352 steps or just different things the script your run will do).  Success or failure is based entirely on the returned value which has to cover the whole thing.

This is just like the exit status from a JSR-352/Jakarta Batch job.

However, if you’re going to say the job failed, it might get restarted.  Of course, restarting a JSR-352 job includes remembering how steps that have run completed and checkpoint data if a chunk step was in-flight when it failed.  If you’ve just got a script doing a bunch of stuff, you’re going to have to leave yourself some notes about what worked and what you want to retry. 

Or you can just fail and leave it to somebody (or some automation) to figure out.  Maybe some higher up thing runs some different job if this one fails to try and clean things up? 

I’m pretty traditional so I’m usually in favor of large multi-step jobs that do a bunch of things, implementing an entire process perhaps.  But done this way where the application is responsible for so much of error handling it seems like maybe it would be better to break it up a little and let something external handle some of the things that seem more infrastructure than application.

Or just spin up a Liberty server and let it to do it, but you need to be sure things like checkpoint data in the Job Repository survive the failure and restart of the pod running the server that’s running the job.