WebSphere Application Server & Liberty

 View Only

JSR-352 (Java Batch) Post #109: Batch Performance – Checkpoint Data Size

By David Follis posted Wed September 23, 2020 08:30 AM

This post is part of a series delving into the details of the JSR-352 (Java Batch) specification. Each post examines a very specific part of the specification and looks at how it works and how you might use it in a real batch application.

To start at the beginning, follow the link to the first post.

The next post in the series is here.

This series is also available as a podcast on iTunesGoogle PlayStitcher, or use the link to the RSS feed

For the next experiment I wanted to explore how changing the size of the serializable checkpoint data returned by the reader or writer impacts the elapsed time of the job.  At every checkpoint the batch container gets this information from the reader and writer and updates the relevant row in the Job Repository table as part of the chunk commit scope.  A minor change in elapsed time, multiplied by a large number of checkpoints, could theoretically impact the runtime for the job.

In theory the state data required to restart from a checkpoint shouldn’t have to be very big.  But if you’re using a non-transactional resource you might have to do some rollback-type cleanup yourself and that might require more data. 

And, of course, sometimes objects just accumulate information.  I’ve seen HTTP session state objects that were several megabytes. 

The question we’re looking at here is how much it matters.  Of course, there are a lot of other factors at play.  Checkpoint data is hardened in the Job Repository which is just a database.  The main consideration is how quickly that database row update can take place.  In my experiments I was on z/OS accessing a local DB2 instance over a type-4 connection.  Switching to a remote database is going to require checkpoint data to flow over a physical network which is going to significantly slow things down. 

Note that while the default size for the BLOB that contains the checkpoint data is 2GB, your DBA might have pruned that down (mine did) and you’ll get SQL errors trying to save very large checkpoint data (I did).  Remember that both the reader and writer checkpoint data are placed together into one column.  It is easy enough to alter the table to make it as big as you need, assuming you can convince your DBA you need the space.

For our runs we began with our same baseline, processing 10 million records in 1000 record chunks.  I updated the checkpoint data size for both the reader and writer to the same value.  Our baseline run uses a 1024 byte object so the total checkpoint data was 2048 bytes.  We scaled that up and down by factors of 10.  For each run, as before, we determined the reader and writer time from the chunk listener and subtracted that from the elapsed time for the step to determine the time spent in the container (which includes the time spent updating the checkpoint information in the Job Repository).

The results:

Data Size (bytes)

Container Time (ms)












Which suggests that you can have pretty large checkpoint data objects without significantly impacting the elapsed time for a chunk step.  Only at really large sizes did it start to matter (2MB and up).  In the end, don’t put stuff in the checkpoint data that doesn’t need to be there, but don’t obsess about it.