IBM Z and LinuxONE - Group home

Introducing IBM Z Batch Resiliency

  

When looking at the modern Z environment, we see a huge transformation taking place.  The z15 is a technological wonder - it is the unsung hero of the modern enterprise, often used as the system of record for many of the world’s largest companies and institutions.  Without it, countless activities in our own daily lives would be severely restricted, as customer facing applications account for a significant part of the IBM Z workload.  What helps to make the IBM Z platform so essential?  Transaction processing and database systems such as CICS, Db2 and IMS are capable of quickly processing huge volumes of data in a single day.  What many people don’t stop to consider is the huge amount of batch processing taking place underpinning reconciliation applications driving updates to business critical databases such as for payments and accounting.

Why do we need to consider batch management?
A key challenge for IT Operations teams, especially on IBM Z, is to be seen to embrace the challenge of a more complex unpredictable workload where the risk of outages increases as operational complexity increases.

With the mainframe more interconnected than ever before, data and workloads are exposed to an increased level of risk and threats, yet application availability is a key requirement. While logs and journals may give databases a comprehensive view of applications, the same may not be the case with batch. Traditional batch processing still largely relies on processes and procedures that were originally devised decades ago.  The staff who understood the nuances of the scheduling are no longer available to provide the expertise that helped design the systems in the first place.  Batch processing cycles in large Z environments often run for longer than 24 hours, often well into the next day, and 36 hour production cycles are not uncommon. If a problem occurs, your scheduler may only restore to the jobs that are needed to be re-run, and miss restoring the data correctly to ensure the continued availability of critical business applications.

Resiliency means the ability to complete essential business functions and meet business SLAs.

For many, the challenges of managing this results in the following strategies:

  • Ignore the problem – hope outages don’t happen (or that someone else will address it)
  • Apply a patchwork approach – create a solution based on multiple tools that are incompatible with one another, or are dependent on the knowledge of a small set of domain experts
  • Overcompensate with additional redundancy factored in – backup everything multiple times causing confusion, disorganization and slower response to recovery as you hope the backup contains the key data.

Any of the above ‘fixes’ fail to truly address the risks and ultimately will result in wasted time, money and resources.

How can IBM Z Batch Resiliency address these concerns and risks?

Wouldn’t it be better to be able to adopt an informed, analytical approach to the recovery, with a high level of confidence, by having the information in one place and the ability to recover from a single panel?  How about the ability to view, in real time, subsequent jobs and data sets affected by the job that experienced the original issue?  Knowing where the latest and previous backups reside is also paramount for performing an effective and timely recovery.  These are all concerns that should be addressed in your resiliency plans using a tool which derives the information based on analytics and maintains it all in one place. 

IBM Z Batch Resiliency allows you to track backups, ensuring that critical data sets are recoverable at any point during the batch cycle.  It can highlight exposures to the overall business resiliency process and allow that the information is at hand to perform either recovery of a specific data set or an entire application.  Using SMF data, it can also identify where data sets are used across multiple applications, enabling selection of the most appropriate backup.  Using the built-in TimeLiner feature, downstream jobs can be easily identified, allowing remedial action to be performed quickly.  While the Cascade and Reverse Cascade reporting features provide simple navigation allowing you to jump between linked Jobs, both forwards and backwards in time, linked by one or more datasets to pinpoint the right point in time to restore from and which jobs need to be re-run by the scheduler.

Time is critical when you need to identify the most appropriate backup to restore and restart batch workloads

Combined with your monitoring, automation and scheduling tooling, your IT Operations team are now in a position to respond to data corruption events that affect batch workloads, through understanding of dependencies and can execute the recovery process in a timely fashion to minimize impact and potential system outage.

Key Takeaways and Next Steps
This is just the first step in improving the ability of operations to embrace and manage modern hybrid cloud workloads. Hopefully this has given you insight into best practice for managing batch workloads and data. You may want to evaluate your current process for this and how you plan recovery from data corruptions. If this has generated questions, we would be pleased to talk further with you on this topic.

If you want to learn more about IBM Z Batch Resiliency contact your IBM representative to set up a technical discussion meeting. Please also review the Announcement Letter and Knowledge Center for more details.