This post is part of a series delving into the details of the JSR-352 (Java Batch) specification. Each post examines a very specific part of the specification and looks at how it works and how you might use it in a real batch application.To start at the beginning, follow the link to the first post.
The next post in the series is here
.This series is also available as a podcast on iTunes, Google Play, Stitcher, or use the link to the RSS feed.
Went with a movie title this week…. There’s probably a song in here somewhere too.
Up to now we’ve talked about partitioned steps that were acting independently. Maybe a bunch of batchlets running concurrently to copy a list of files. Or more commonly a chunk step with each partition reading and processing a different range of a large data set. But what if the point of each partition is to build towards a single result for the whole data range.
Suppose we want a step that analyzes a large set of data looking for records that match a certain pattern (and assume this isn’t something we can do with some clever SQL – because it makes the example work). Running without a partition, the chunk step will read every record, process it to determine if the record meets the criteria and increment a counter, and that’s it. The writer doesn’t do anything. When we run out of data, the result of the step is the final value in the counter. That’s all we want.
To do this faster, we’d like to run this step using partitions. We set up the
PartitionPlan so each partition knows what range of data to read and process. But as each partition ends, they each have their own result. How do we bring those together and add them up to the final answer we want? The Collector/Analyzer is the answer.
PartitionCollector gets control at the end of each chunk. It doesn’t get anything passed to it, so any result it needs to work with has to be hung on the thread or attached to the
StepContext. The collector returns a serializable object, which can be anything you like. In our example it is probably an object that contains the count of matching records from this chunk (remember to reset the count after each chunk happens).
As chunks complete on the running partitions and the Collectors provide data, the
analyzeCollectorData method gets control back on the main thread for the step and is passed the serialized data provided by a Collector. This is where you would add the result into the running total for the whole step. A transaction is wrapped around all of the analyzer processing, so whatever it does will commit (or rollback) when all the partitions end.
The Analyzer also has a method called
analyzeStatus that gets the final batch status and exit status values for each partition. This allows the analyzer to know that a partition has finished and how it turned out. If something went wrong with one partition, this is how the analyzer finds out. The batch status might be just fine, but something subtle could have happened that the partition can communicate to the analyzer via the exit status string.
Be careful with collectors and analyzers. If a chunk completes and commits, there is no guarantee the collector data from that chunk will make it to the analyzer if something bad happens. If you are trying to make a job using a collector/analyzer restartable, you should bear this in mind.