This post is part of a series delving into the details of the JSR-352 (Java Batch) specification. Each post examines a very specific part of the specification and looks at how it works and how you might use it in a real batch application.To start at the beginning, follow the link to the first post.
The next post in the series is here
.This series is also available as a podcast on iTunes, Google Play, Stitcher, or use the link to the RSS feed.
One of the things that some batch applications try to do is reduce execution time is by “pipe”-ing the output of one step into another one and running them concurrently.
The idea is that one step is processing records and producing results and there is a second step that needs those results to do its own processing. Why wait until the first step has finished and processed all its input before starting up the second step to feed on the results from the first step (so far)?
Well, the batch spec doesn’t specifically support this because steps run in an order. But with a split/flow you can get different steps running concurrently without needing separate jobs. Yes, you could do what I’m describing with separate jobs too, this just seemed cooler.
The idea is pretty simple. The job has a split. The first flow in the split is a chunk step that reads data from some source and does some processing. At checkpoints the resulting records are put in a queue (whether a real message-queue or just some file-based implementation that acts like one). The second flow in the split has a reader that reads records from the queue, does its own processing, and writes its results to a second queue. A third flow reads from the queue being filled by the second flow, etc.
At each checkpoint in each flow the results from that chunk are pushed to the next flow which can start processing them. It isn’t quite a smooth pipeline as records move between the flows in chunk-size piles. But it beats waiting for each step to complete before moving along.
You will need to carefully consider how this ends. There needs to be some sort of all-done message that goes on the queue to tell the secondary flows to complete. And be sure to get that message passed along in case an error causes one flow to shut down early. Also consider how a restart of that failed job will handle finding messages in the queue from a previous run if the flow that handles them dies early.
The general flow of this could be pretty generic. Might make a nice sample. Check out IBM Techdoc WP102784!