This post is part of a series delving into the details of the JSR-352 (Java Batch) specification. Each post examines a very specific part of the specification and looks at how it works and how you might use it in a real batch application.To start at the beginning, follow the link to the first post.The next post in the series is
here.
This series is also available as a podcast on iTunes, Google Play, Stitcher, or use the link to the RSS feed.
-----
The Job Repository is really just a set of tables living in a database. Multiple Liberty servers can easily be configured to use the same set of tables. They can just as easily be configured to use separate sets of tables. When should servers share a repository (if ever) and when should they stay separate?
The key, as we mentioned last time, is the numbers (identifiers) that represent the jobs. Every time a new job is submitted it receives an instance id and an execution id. Those identifiers are unique within the Job Repository. So, servers sharing a repository will get unique identifiers from one set. Servers using a different repository can get the same identifier values, but they obviously represent different jobs.
Thus, at some level, when someone says to you, “I need the output from job 75348.” will you be able to figure out which actual job that is? Is it obvious from the context of the question which repository they are using and thus which actual job that is?
Consider it from another direction. If you make a REST request to a server asking for a list of jobs meeting some criteria, that list is generated from the Job Repository being used by that server. If that repository is shared by other servers, the returned list might include jobs run in those other servers. Does that make sense to you?
Probably development, test, and production should have separate repositories. The earlier stages might have quite a few different repositories (each developer might have their own or be using an in-memory repository). But in production, is one enough? Should you partition them by Line of Business or some other organization boundary? Or just have one big production repository that is used by everything?
Most factors generally encourage you to use a single repository for a production environment, although there are certainly reasons to separate them. I’ll just bring up one here..
Remember the repository doesn’t just keep track of the jobs that have run and their final status. Lots of in-flight information about running jobs is kept in the repository. For a chunk step, every checkpoint requires updating checkpoint information in the repository. If the database containing the repository tables is physically distant from the server running the job the latency of pushing that update to a remote database at EVERY checkpoint (of thousands, maybe millions) could significantly impact the elapsed time to execute the job.
Best practice? Whatever works best for you : - )