This post is part of a series delving into the details of the JSR-352 (Java Batch) specification. Each post examines a very specific part of the specification and looks at how it works and how you might use it in a real batch application.To start at the beginning, follow the link to the first post.
The next post in the series is here
.This series is also available as a podcast on iTunes, Google Play, Stitcher, or use the link to the RSS feed.
This week we’ll take a look at the performance of our itemWriter. Remember that it alternates inserting rows in one chunk and then deleting those same rows in the next chunk. This was mostly to avoid having to remember to clean out the database between runs, but it also gave me a chance to explore the performance difference of doing a bulk insert vs doing one-by-one deletes. Of course we’re comparing apples and oranges here because insert and delete processing is different. I was more interested in how the two things scaled.
Let’s take a look at the conventional approach where we did the deletes one by one. We’re processing 10 million records and deleting half the time, so in total the step deletes 5 million records from DB2. The commit size varies depending on our item-count value. At an item-count of 25, the total time spent doing deletes was about 185 seconds. With an item-count of 1000, we spent 209 seconds doing deletes. And scaled all the way up to 7000 records per commit, we spent a total of 213 seconds doing deletes.
Increasing the commit size caused the step to run a little bit longer. The improvement from 25 to 1000 was 24 seconds which isn’t a lot, but it helps. But pushing it farther didn’t really buy us much (just 4 seconds). In the end I think it doesn’t matter a lot how many deletes you do in each chunk. Deleting 5 million records is going to take some time.
What about the bulk inserts? That’s a different story entirely. In the alternating chunks we’re going to insert a total of 5 million records, but all in one JDBC interaction that inserts however many records are in our item-count size. Starting again at 25 I found that the step spent about 42 seconds doing inserts. Scaling up to 1000 items at a time the inserts cost 20 seconds. That’s a huge improvement. Ramping up again to 7000 items the elapsed time only dropped to 18 seconds. Clearly there’s a limit to how much bulk inserts buy you.
Looking at the data for the entire step it was clear that by far most of the elapsed time for the job was spent in the writer. Remember the whole step took four minutes and 33 seconds. It depends on the item-count value, but in general our writer spent around 200 seconds (3 minutes 20 seconds) doing deletes. While inserting rows only took around 20 seconds once we got away from low item-count values.
The breakdown at an item-count of 25 is interesting. Remember that run took about 24 minutes. As noted above, we spent 185 seconds doing deletes and 42 seconds doing inserts. The rest of the time was spent in ‘container overhead’ (around 1207 seconds). The container overhead dropped to just 41 seconds at an item-count of 1000.
Bulk interactions with a database are better. Container overhead decreases dramatically as you get past small values. In general (for my application!) an item-count value of around 5000 seems pretty good.