WebSphere Application Server & Liberty

JSR-352 (Java Batch) Post #29: To Checkpoint Now, or Not To Checkpoint Now, and if Not Now, When?

By David Follis posted Wed February 13, 2019 10:02 AM

This post is part of a series delving into the details of the JSR-352 (Java Batch) specification. Each post examines a very specific part of the specification and looks at how it works and how you might use it in a real batch application.

To start at the beginning, follow the link to the first post.

The next post in the series is here.

Let’s cover the easy stuff first.  If you don’t specify anything, the default is to checkpoint after every ten items are processed.  That means the chunk reader and processor will be called to read and process ten items, the writer will be called with the results of processing, and the transaction wrapping all that will be committed.

Is ten the right value?  Maybe, but probably not.  Before we talk about how often you should checkpoint (here’s a hint, “It depends..”), we should take a look at different ways you can control when a checkpoint is taken. 

First of all, you can override the default value of ten items by specifying an item-count value in the chunk element of the JSL.  You can also specify a time-limit (in seconds).  That causes a check after each pass through the reader/processor to determine if the time limit for this chunk has expired and, if so, to commit.  With a time limit you could easily process a different number of requests in each chunk, depending on how things are running.

I should mention a quirk of JSL here.  You can specify both an item-count and a time-limit and they both apply.  The checkpoint will happen after the count of items or the time limit has expired, whichever comes first.  The default for time-limit is zero, which means no time limit applies.  However, if you just specify a time-limit value, the default item-count value of ten still applies which might hit faster than your time-limit.  If you just want a time-limit, you should probably configure a really large item-count value to keep it out of the way.  The specification doesn’t say what happens if you specify an item-count of zero, so probably safest not to count on that turning it off, even if some implementations behave that way.

Should you checkpoint often?  Or not?  In favor of frequent checkpoints is minimizing the amount of work that has to be re-done if a failure happens and work since the last checkpoint is lost.  If locks are being held by chunk processing, smaller checkpoint intervals minimize the time those locks are held. 

On the other hand, frequent checkpoints can add a lot to the elapsed time for the job.  Suppose (and I’m just making this number up) doing a checkpoint takes half-a-second.  For a job processing one million records doing checkpoints every ten items, the cost of checkpointing is (1,000,000/10)*0.5 = 50,000 seconds which is almost 14 hours.  If we checkpoint every 1,000 records instead then it only adds 500 seconds.  Again, I made up the half-second number, but you can see any cost multiplied out by a lot of checkpoints is going to add up. 

Choose wisely…(or at least have a solid-sounding explanation for your choice :-)