WebSphere Application Server & Liberty

 View Only

JSR-352 (Java Batch) Post #107: Batch Performance – Checkpoint Intervals

By David Follis posted Wed September 09, 2020 09:23 AM

This post is part of a series delving into the details of the JSR-352 (Java Batch) specification. Each post examines a very specific part of the specification and looks at how it works and how you might use it in a real batch application.

To start at the beginning, follow the link to the first post.

The next post in the series is here.

This series is also available as a podcast on iTunesGoogle PlayStitcher, or use the link to the RSS feed

From last week, we have a chunk step that reads 10 million records from a flat file and then alternately inserts and deletes records from a DB2 database.  Using an item count of 1000, the step took four minutes and 33 seconds to run on our performance test system.

What we want to know is what happens if we increase or decrease that item count value.  For an item count value of 1000, our step takes 10,000 checkpoints.  We dropped the item count down to 500, then 250, then 125, then 75, and finally 25.  For that final test the step took 400,000 checkpoints.  You would think that would make the step take a lot longer to run…and it does.

Lowering the item count to 500 (twice as many checkpoints) added about 30 seconds to the elapsed time to run the step.  Dropping it to 250 added another full minute.  By the time we got down to 25, the step took almost 24 minutes to complete. 

Clearly increasing the number of checkpoints taken can greatly increase the elapsed time for the job.  But what happens if we go the other way?  If we increase the item count to 2000, will the step run faster?  Turns out that it does, but not by as much as cutting the item count in half did.  In this case the step ran about 16 seconds faster. 

We kept increasing the item count size up to 7000 and saw progressively small improvements.  At 7000 the step took just over 1400 checkpoints and ran in about four minutes and four seconds. 

At this point I should emphasize that this data is only relevant for the application I was running and in my particular environment.  Your results may (and probably will) vary quite a bit. 

That said, it does seem safe to conclude that cutting the item count value way down dramatically increases the number of checkpoints that are taken and those do come at a cost.  We can also see that there is some limit to the benefit of increasing the item count to reduce the number of checkpoints.  A happy value for my situation seemed to be around 5000.

Another factor to consider when choosing a checkpoint interval is how hard it will be to recover from failure.  If my chunk step were to fail on item 4999 of a 5000-item-count chunk any updates would have to be rolled back.  DB2 is going to do that automatically for me, but if your application and environment require some manual intervention to roll back changes…well..that’s a lot to roll back by hand.  You might want to consider accepting a bit longer elapsed time for the step to reduce the pain when something bad happens (the key word there being ‘when’ instead of ‘if’). 

Next time we’ll look a bit closer at where our chunk step is spending that time.