This post is part of a series delving into the details of the JSR-352 (Java Batch) specification. Each post examines a very specific part of the specification and looks at how it works and how you might use it in a real batch application.To start at the beginning, follow the link to the first post.
The next post in the series is here
.This series is also available as a podcast on iTunes, Google Play, Stitcher, or use the link to the RSS feed.
We’ve touched on this in earlier posts, but since we’re covering all the restart processing lately, I thought it was worth going through again. The idea is that you have a batch job that was running a chunk step when the job was either stopped or it failed (threw an unhandled exception). When you restart the job, processing goes through earlier steps in the job as we discussed last time and ends up back in the chunk step. What happens now?
The checkpoint data is retrieved from the Job Repository for the step from the last execution and the open method for the
ItemWriter are called, providing the serialized checkpoint data. This is exactly like the processing that open has to do in handling a retry-with-rollback scenario. The reader and writer both have to orient themselves, so they are ready to pick up where they were at the last checkpoint. This might mean getting a database cursor positioned properly, or maybe getting positioned in a file at the appropriate record. It depends on the data source being used.
Once that is done, processing proceeds as normal for a chunk step. Unlike the retry-with-rollback scenario, we do not “creep up” on the failing record one item at a time. A restarted chunk step just uses whatever checkpoint interval or algorithm is normal for the step.
While a step can have a limit on the number of retries (or skipped records) it will allow before failing the job, a failed job can be restarted any number of times.
Remember that for a partitioned step containing chunk processing, each partition will normally go through this exact processing. Although you can change that by choosing to redefine how the partitioning is done.
If the job didn’t fail doing write processing and your reader handles receiving checkpoint data appropriately, restart processing is almost magical. But if the failure occurred during write processing and the writes aren’t to a transactional resource, then it can get complicated. In this case the open method for the writer is going to have to undo updates that were made after the last checkpoint because they won’t have rolled back automatically.
Being able to restart a job, and especially a job with chunk steps, can be critically important. Carefully consider what your application is doing, what resources it is accessing, and how your code will react when called in a restart (with checkpoint data) to be sure it does the right thing. Operations will tend to assume a failed job can be restarted. Don’t surprise them….
Naturally our song this week is “Surprise, Surprise” by the Rolling Stones.