WebSphere Application Server

JSR-352 (Java Batch) Post #26: Processing Data with an ItemProcessor

By David Follis posted Wed January 23, 2019 09:24 AM

  


This post is part of a series delving into the details of the JSR-352 (Java Batch) specification. Each post examines a very specific part of the specification and looks at how it works and how you might use it in a real batch application.

To start at the beginning, follow the link to the first post.

The next post in the series is here.
------

The first thing to remember about the ItemProcessor is that it is optional.  You can define a chunk step with just a reader and a writer.  That makes sense because you may have cases where you are just reading data from here and writing it over there and there’s no real “processing” to do on the data.  So it is ok to just leave this out.

 

If you have a processor, you should carefully consider what data will be provided to it.  Obviously it is the record that the ItemReader read in his readItem method.  But what is that, specifically?  You might just think that it is the record you read.  That might be data object built from a row in a database or it might be a String (maybe JSON).  But remember that this is how you communicate between the reader and processor.  Creating an object to wrap the read data allows you to include other attributes (flags, etc) that might be handy.  Not sure what that might be?  Couldn’t hurt to have a wrapper class anyway so you can have a place to add things later when you realize you need something.

 

There’s also a temptation to write a ‘generic’ reader that reads from some data source you use a lot and provides its input to a bunch of different processors.  That’s pretty handy and allows you to build new batch jobs by putting together existing readers and processors.  But remember that making these artifacts generically usable for different purposes will sometimes mean code executes in one use that is only needed in some other use.  Might not seem like a big deal, but remember that the reader and processor will execute over and over for every record you process.  A few extra bits of code, executed a few million times unnecessarily, can substantially add to the elapsed execution time for the step.  So remember to think about cost when you are trying to write reusable batch artifacts.  

 

What are we talking about?  Oh right, the ItemProcessor.  It just has one method, processItem.  And it does whatever processing you need it to do.  There’s not much to say about that.  You know what you need to do to the data and this is where you do it. Call a rules engine, do some data transformation, ask Watson for some analysis, do fraud detection, check inventory…whatever.  This is where the action is.

 

When you are done, if you have something to write you return an object that will get passed to the ItemWriter to write.  We’ll see that the writer is getting all the objects to write at once at the next checkpoint so these are piling up in memory until the checkpoint.  Don’t get crazy with extra data in the write-object if your checkpoints could involve a lot of processed data.

 

We’ll look at the writer next…

0 comments
9 views

Permalink