Recover quickly with surgical precision
Resiliency is an integral part of the IBM Z platform. The IBM Z Cyber Vault solution is a concept for addressing data corruption whether through malicious intent, human error or failure of hardware or software. It provides an air gapped backup of the operating system and all data. The Safeguarded copy process is designed to take copies at a user specified interval. This blog doesn’t describe or explain the details of the process but rather to discuss one part of the software stack that brings added value to this resiliency solution. It provides some very specific functions and capabilities to make the recovery process simpler, faster, and less manual.
IBM Z Batch Resiliency and cyber resiliency
IBM Z Batch Resiliency (IZBR) is designed to ensure that all the non-database managed data has the same resiliency as the database managed data. DB2 and IMS have recovery logs and tools, but these recovery capabilities do not exist for data such as sequential and VSAM files. Up to now, it is up to each application to be sure files are backed up and that they can be recovered. This can become a very complex and convoluted process which often leads to too many backups and wasted resources. In addition, when a failure occurs (and it is when rather than if a failure occurs), it is often difficult to know the best backup version to use for recovery. IZBR has been developed to solve this problem by understanding the use of the data, monitoring all the backups, and ensuring that the right backups are taken. When an error occurs, the most appropriate backup can be used to restore the data in minutes. IZBR uses extensive analytics of various data sources such as SMF data, input from the scheduler, and other sources to store an inventory of all critical data sets and their backups.
The latest release of IBM Z Batch Resiliency has been enhanced to provide some of these capabilities for the supporting cyber resiliency strategies. One of the key values is that IZBR provides information on the status of the non-database managed data at the time of the Safeguarded copy. While the database management tools can process “fuzzy” backups, non-database managed data sets are considered unreliable if they are open for output at the time of a backup. Because the Safeguarded copy is done at a volume level there is no consideration of the state of the data sets, so it is very likely that some non-database managed data sets will be open for output.
IZBR provides two different ways to get this information. First, the IZBR Cyber Vault Health Check report provides a list of all data sets that are open for output at the time of the Safeguarded copy. This report could be used to identify the best Safeguarded copy to use (that is, the one with the fewest data sets open for write) or it might be used to find alternative Safeguarded copies for selected data sets that are unusable from the primary Safeguarded copy that is being used.
The second way is through IZBR’s surgical recovery mechanism, which provides a data set list capability similar to ISPF 3.4. From this list, one or more data sets can be selected. IZBR will display which Safeguarded copies contain that data set and whether the data set was open or closed. In addition, with a recently added capability, IZBR also keeps track of where the volume the data set was located over time. This new feature, called the 3D Virtual Katalog, is invaluable in today’s dynamic storage environments where a data set may not always be on the same volume. This saves the analyst from having to go hunting for a data set when the data set is no longer on the same volume in past Safeguarded copies as the catalogue says it is on today. Without the 3D Virtual Katalog this would be a time-consuming hunt. In addition to being able to recover non-database managed data, the surgical recovery capabilities can be used to recover any data set from the Safeguarded copies. This is a key capability that can greatly simplify and speed recovery. Surgical recovery will be described in greater detail below.
Of particular interest to the IBM Z Cyber Vault solution are IZBR’s Timeliner reports. The Timeliner report, specifically the Reverse Cascade Report, can be used to, from a particular point in time, look backwards for a data set and see all the jobs that have used or influenced this data set. It will display these jobs and show, for each job, the data sets that were used both as input or output. This information provides analytics to forensically help identify the points where a data set could have become corrupted.
In addition to the Reverse Cascade Report, IZBR provides a similar view but advancing from the selected point in time (Forward Cascade Report). This report is very useful to create a recovery plan for a data set that is being restored. The Forward Cascade Report can be run from the point in time that the data set is being restored and it will show all the work, even if it involves multiple applications, that needs to be rerun to bring the data set to the desired state. The alternative would be to use the scheduler to rerun all jobs, which may cause unnecessary work to be run (jobs that had nothing to do with that data set) or work may be missed because the scheduler does not know the relationship of a data set to an application. If multiple applications are dependent on the data set, any secondary applications may be out of sync with the restored data set.
Supporting the IBM Z Cyber Vault for surgical recovery of data
Let’s look at the use case of application corruption to visualize how IBM Z Batch Resiliency forms a critical part of the IBM Z Cyber Vault solution. In this case the surgical recovery of that application’s data is sufficient to recovery the environment. In a future blog, we will discuss how IZBR can assist when a catastrophic recovery is needed.
In this use case the assumption is that the IBM Z Cyber Vault validates the system and application Safeguarded copies are produced once a day. This document is not intended to explain or describe IBM Z Cyber Vault, but it is necessary to review the validation processes since they are referenced. This description comes from the Getting Started with IBM Z Cyber Vault | IBM Redbooks. There are three types of data validation that are specific to each IBM Z environment:
- Type 1 validation – System data
Validate whether an LPAR can fully IPL from the restored volumes, checking out core parts of z/OS by enabling subsystems and logging into them.
- Type 2 validation – Data structures
Perform health checks, run scripts and tools to validate catalogs and other core parts of the system, check and validate Coupling Facility structures. Validate that key middleware, databases, and runtimes are operational. These data structure validations of CICS, MQ, Db2, IMS, and batch environments ensures the z/OS image can run applications, handle transactions, and process data. All these validations are needed to know that a system is fully operational.
- Type 3 validation – Application data
This validation is to ensure that application and user data stored in datasets, databases, or other subsystems is valid. This validation is the final step to ensure that a copy is not corrupted from malware, ransomware, or any other source of intentional or unintentional data corruption, and able to be trusted. These validations can be done by running numerous database queries, running batch programs and online transactions, and running other application tests to prove that data is available. It is the responsibility of the application, database, and technology teams to provide the appropriate tools and scripts to run these tests, which will be incorporated into the Cyber Vault automation framework to be executed.
In this use case, Type 3 validation of application data is utilized.
To begin, an alert is triggered when a non-database managed dataset fails application validation during nightly Cyber Vault processes. Security and recovery personnel will use IBM Z Cyber Vault to do forensics to analyze the failure and they will use IBM Z Batch Resiliency to recover the data set and develop a recovery plan.
"When did the corruption occur?"
By knowing the application data validated successfully in the previous day’s Safeguarded Copy, the corruption happened sometime in the last 24 hours. Once the specific application data affected is identified, identify all the jobs and users that updated that data set. Are there any unexpected jobs or users updating the data set? Perhaps there is an FTP job that brings data in from a business partner or there are other applications in the cloud that provide data to the application? The IZBR Timeliner reports will help to identify these types of jobs that may have introduced the corruption. The IZBR Timeliner Reverse Cascade report can be run to look at all jobs and users that updated this data set in the last 24 hours (or any time span) to identify the needed information quickly and easily. While IZBR cannot detect the corruption, it can itemize all the points in time when the data set was open for output. The cause and intent of the corruption is also out of the scope of IZBR but by quickly identifying where the update could have occurred, resources can focus their analysis at these points to determine which of them resulted in the corruption.
"What application or applications are affected?"
Even if only one data set failed the validation process, it is important to know if there is more than one application that is dependent on this data set. This could be helpful if it is not readily apparent where the corruption is introduced. If, for example, two applications use the data set, the corruption could have come from either application’s work flows. There are two IZBR reports that help to address this question. The first would be the CROSSDEP report, which identifies when multiple applications use a data set. The second report is IZBR Timeliner Reverse Cascade report. This report looks backwards from a point in time showing every job and user that has opened the data set. Without these reports, finding the information would depend on either application documentation or an analyst that knows. Often this type of information is not readily available and difficult to determine with certainty.
"What backups are safe and available (application, Safeguarded copies)?"
Once the cause of the corruption is determined and neutralized and it is deemed safe to restore and recover, it is important to know the best source to recover from. This is largely based on the cause and nature of the corruption. If the nature was known to not be malicious and the regular application backups are reliable then this may be the most obvious choice. However, if there is any doubt to the reliability of the application backups then the Safeguarded Copy is the better choice.
For the sake of this use case, it has been determined to use the Safeguarded Copy to restore the data via a surgical recovery of the affected data set(s). Since the corruption in this case was isolated to just one non-database managed data set the option is to use IZBR to surgically recover the data set.
Within IZBR the user can type in the name of the specific data set. IZBR looks in the Safeguarded Copies and the 3D Virtual Katalog, presenting the user with all the Safeguarded Copies that contain that data set. In addition, the volume or volumes where the data set resided at the time of the copy will be displayed. Of critical importance is the state of the data set on the Safeguarded Copy. For the sake of simplicity, it was closed on the last Safeguarded Copy before the corruption. This Safeguarded Copy is selected from the list and the JCL to surgically restore the data set is presented. The data set can be restored to a staging volume with new name so that it can be verified after the restore to be sure it is what is expected. Once verified it can be renamed and put into production ready for forward recovery.
In addition to surgical recovery of non-database managed data, IZBR can be used to recovery any data set. This would include DB2 files, CICS VSAM files or virtually any data set that is on the Safeguarded copy. As described above, it is as simple as providing a data set name or mask and getting the list of the data sets. From that list you can put an "R" next to the dataset and generate the job to surgically recover the data set.
Without IZBR, this would be a manual and time-consuming process to restore a single data set. First it depends on knowing on what volume the data set resided at the time of the copy. While this may be static in many cases, with SMS, it is very possible that data sets could be on different volumes at different times. If the data set was on a different volume, it would take valuable time to research where it was at the time of the copy or it would require recovery of unnecessary volumes in order to go hunting for the data. Secondly, to do a surgical recovery manually, entire volume or volumes need to be recovered and the data set moved from the recovery volume to the target volume after verification that it is the appropriate data set. This takes both time and resources to complete, which can be costly, especially if it must be done multiple times in order to find the volume where the data set was at the time of the copy.
"What work needs to be rerun to recover to current?"
Once the data set or data sets are recovered, the work that needs to be rerun to bring that data set in sync with the rest of the application needs to be identified. As in all steps of recovery, time is of the essence. It is important to get back to a current state as quickly as possible. One of the ways to do this is to avoid runs of unnecessary work without, just as importantly, missing critical work that must be completed. The IZBR Timeliner Forward Cascade report helps to identify the most accurate and efficient forward recovery plan. This report identifies all the work across all applications that is dependent on this data set. This will help to prevent running unnecessary dependent jobs from the scheduler if they have no dependency on the data that was restored. Without access to the Timeliner Forward Cascade report, the scheduler would be used to run ALL dependent jobs from the point of the restore. First, this could cause unnecessary work to waste resources and time. Secondly, the scheduler does not have a data set view of the applications. If there are other applications that depend on this data, they could be missed in the recovery process and cause more issues in the future.
Conclusion and next steps
In conclusion, the IBM Z Cyber Vault is a leap forward in providing a structure and foundation to ensure that many of the risks of ransomware, software and human errors or other forms of corruption can be mitigated. There are several hardware and software products necessary to allow this solution to provide the desired results.
IBM Z Batch Resiliency provides unique capabilities that further enhance your cyber resiliency strategy to ensure that when the time comes to utilize the IBM Z Cyber Vault in a critical moment that you can recover your non-database managed data quickly and accurately. Without IZBR, this could prove to be difficult, time consuming and error prone. When time is of the essence and the stress levels are already high, IZBR helps to minimize the time and the angst when it comes to recovering non-database managed data.
If you want to learn more about IBM Z Batch Resiliency and how it can assist within strengthening your cyber resiliency strategy, please reach out us or your IBM representative for a discussion and demonstration.
There are more details about the IBM Z CyberVault solution and scenarios described here in this excellent IBM Redbook. Watch out for another blog on this topic incorporating IBM Z Batch Resiliency for addressing catastrophic recovery support