This post is part of a series delving into the details of the JSR-352 (Java Batch) specification. Each post examines a very specific part of the specification and looks at how it works and how you might use it in a real batch application.To start at the beginning, follow the link to the first post.The next post in the series is
here.
This series is also available as a podcast on iTunes, Google Play, Stitcher, or use the link to the RSS feed.
-----
You’ve got a lot of data to process and you’ve decided to try to speed things up by using a partitioned step. Partitioned steps allow you to run multiple concurrent copies of the same step, on different threads. If the amount of data is constant or rarely changes, then you might statically partition by including partitioning instructions directly in the JSL of the job. If you want to partition by US states, for example, you’re pretty safe with static partitioning in the JSL (trivia – Hawaii became a state in 1959 and is the most recent state added).
But if your data source can change, then you’ll need to write a partition mapper to break things up on the fly. The first step in doing that is to figure out how much data you’re working with. You might do a SELECT COUNT(*) from a database to find out how many rows you’ll have to process. Or maybe you can do something clever with file size. In our sample, we’re going to be processing all the files in a specified directory, so we’ll just go get the list of files in that directory and the length of the list is our count.
Once you know how much data you have, you need to figure out how to split it up. In our case, the simplest thing to do is just have one partition per file we found. But suppose we find thousands of files? We probably don’t want thousands of partitions. If we’re going to assign multiple files to be processed by each partition, then the code in the step is going to have to expect that and not just one file to each instance of the step that runs.
What if you are reading from a single flat file? The first partition will start at the beginning of the file, but other partitions might have to read through the early records until they get to their starting position. It might be faster to use a utility to split the file into partitions and then have each batch partition instance read its own file from the beginning. Maybe.
On the other hand, if you are processing records from a database you can just assign each partition to process a range of data using ranges of the primary key value to split things up. But you still have to make a decision about how many rows per partition and thus how many partitions.
How do you decide? Partitioning is most useful when there is no contention between the different instances of the step. Another factor to consider, though, is how many instances will be allowed to run at once. You might create 100 partitions, but only allow 10 to run at one time. That’s another thing you can set as part of the partition map. The batch container won’t necessarily run that many at once, but it won’t run more than that.
In the end you’ll likely just run some experiments and see how it goes. Remember that your goal is to reduce elapsed time for the job. Contention problems might make it faster to just run the step unpartitioned. System resources might limit how many partitions can reasonably run at once.
I’ve given you a lot of things to think about, but we haven’t talked about how to code this. That’s for next time.