IBM Z and LinuxONE - IBM Z - Group home

Planning for the expected - increasing application resiliency for planned outages

By Xiao Xia Mao posted Thu February 27, 2020 10:01 PM


Although being a highly unlikely occurrence, IT departments need to spend resources to ensure that their business critical data and applications can be successfully recovered in the event of an unplanned failure of their production site. Minimizing data loss is the highest priority but it can be at a tradeoff with application availability. Typically, following an unplanned outage, the entire data center can be restarted at the disaster recovery (DR) site but this can take several hours or longer before the applications are available.


What is sometimes overlooked is how to maintain application availability for a far more likely scenario, a planned outage for a maintenance activity.  IT departments schedule maintenance windows in order to apply software fixes or perform application upgrades.  Their goal is to minimize the number of  maintenance windows required as well as ensure the duration of each window is as short as possible.  Despite these efforts, these maintenance windows can still last for several hours and occur multiple times per year. Since recovering the data and applications on the DR site could take several hours, it makes little sense to attempt to utilize the DR site during planned maintenance activities, as the time it takes to switch to the DR site and back to the production site could be longer than the maintenance window itself. As a result, IT departments try to schedule these maintenance windows with the aim of minimizing the impact to their customers, usually on a weekend night.


What if there was a way to quickly switch access to business critical applications and their data from one site to another in a few minutes, rather than a few hours?  With application unavailability for maintenance windows shrinking down to several minutes, these windows could be scheduled more frequently, ensuring systems and applications are always running with the most up-to-date fixes. So how can this be accomplished? By using a software data replication product to keep data sources used by the applications in sync across two sites, and IBM Multi-site Workload Lifeline to distribute connections for these applications, such a reduction in site switch times can be achieved.


IBM Multi-site Workload Lifeline, or Lifeline for short, provides the ability to perform a graceful switch of the applications and their data sources, called workloads by Lifeline, during planned outages.  By using simple Lifeline commands, workload migration from one site to another can be easily performed, minimizing the down time for planned events such as scheduled maintenance activities. So what makes Lifeline different from existing disaster recovery solutions?  Well first, Lifeline is not an all-or-nothing solution. Rather than initially plan for, and provide system resources for the planned recovery of all workloads in the production site, IT departments can focus on their most critical workload first, and gradually roll out the solution for additional workloads, as needed.  A second differentiator is that Lifeline requires no application changes or changes to the clients accessing the applications and data. Following a planned outage, no manual changes in the network topology is necessary before the workload is able to be accessed on the alternate site.


As mentioned earlier, a key component to ensure a quick switch of applications and data to the alternate site is software data replication. Depending on the data source being used by the application, a different software replication product would be used to keep the data source in sync across the sites. For example, for applications utilizing DB2, IBM InfoSphere Data Replication for DB2 would be used to keep DB2 data in sync. Lifeline ensures connections for a workload are distributed to only one site at a time, to make certain that updates to the data source are occurring on only one site at any point in time.


Lifeline enables the graceful switch of a workload from one site to the other by:

  • First, preventing new connections for the workload to be distributed to either site while allowing existing connections to the production site a chance to complete their work,
  • Next, resetting any connections on the production site that have not completed their work. This guarantees that no additional updates to the workload's data source can occur on this site.
  • Finally, allowing new connections for the workload to be distributed to the alternate site.

Lifeline takes Z Resiliency even further by helping with a continuous availability solution, for both planned and unplanned outages. For more information, see IBM Multi-site Workload Lifeline product page.