co-authored by Claudia Zenter
Have you been asking yourself whether IBM Business Automation Workflow (BAW) on traditional IBM WebSphere (tWAS) allows for an active/active configuration across data centers (DCs) to address high availability (HA) and disaster recovery (DR) requirements? If so, this blog entry provides the answers and respective background on the subject.
While this is not a new question, it has been something of broader interest recently.
The official answer is:
An active/active configuration of IBM Business Automation Workflow with one IBM WebSphere Application Server cell spanning multiple data centers is strongly discouraged.
Such a deployment will eventually lead to situations that are the result of using tWAS outside of its design criteria that cannot be resolved or addressed by IBM Product Support.
The remainder of this article will look in more detail at the technical reasons for this. It is important to note that the explanations apply only to traditional BAW installations, not to containerized BAW or the BAW and Workflow Process Service components of IBM Cloud Pak for Business Automation. The latter are using IBM WebSphere Liberty and Kubernetes/OpenShiftnot tWAS.
Introduction
BAW builds on top of WebSphere Application Server Network Deployment (tWAS ND, for short tWAS). It inherits its quality of service and support policy for HA/DR from tWAS. Therefore, answering the question requires looking at what tWAS supports.
Former IBM tWAS HA/DR lead architect Tom Alcott discussed in detail why a single WebSphere cell spanning multiple data centers is a really bad idea. This post covers the topic holistically with special focus on BAW. You can read Tom’s original comments in his developerWorks article (first published in 2006, last updated in 2014) attached to this post.
High Availability (HA) versus Disaster Recovery (DR)
To level set, let’s start with briefly defining the two core concepts: High Availability and Disaster Recovery.
High Availability (HA):
- Ensure that the system can continue to process work within one location after expectable failures of one or a small number of components. In order to avoid network latency one location is typically scoped to one data center.
- The main goal is typically to avoid disruptions in case of unplanned events to short periods of time and only a subset of the users of the system.
Disaster Recovery (DR):
- Ensuring that the system can be rebuilt and/or activated in a different location and can process work after an unexpected catastrophic failure at one location.
- There may or may not be significant downtime as part of a disaster recovery.
- Two key characteristics for DR are:
- Recovery Time Objective (RTO): Time until the service is restored
- Recovery Point Objective (RPO): Amount of data that would be lost after service has been restored
For more details around both concepts take a look at the following articles What is high availability (HA)? and What is disaster recovery (DR)?
IBM WAS ND as the foundation for IBM Business Automation Workflow
Core tWAS ND concepts
When it comes to HA/DR, BAW inherits its characteristics from tWAS. Therefore, a quick recap of a few core tWAS concepts is required:
- Application Server: The server runs a Java™ virtual machine providing the runtime environment for the application's code.
- Cluster: A logical grouping of one or more functionally identical application servers. A cluster provides ease of deployment, scalability, workload balancing, and failover redundancy. A cluster is a collection of servers working together as a single system to ensure that mission-critical applications and resources are continuously available to users.
- Node: A node is a logical group of one or more application servers on a physical computer. A node usually corresponds to a physical or virtualized computer system identified by a distinct IP address.
- Deployment Manager: The administrative process used to provide a centralized management view and control for all elements in a WebSphere Application Server distributed cell, including the management of clusters.
- Cell: The administrative domain that a Deployment Manager manages. A cell is a logical grouping of nodes that enables common administrative activities in a WebSphere Application Server distributed environment. A cell can have a single or multiple clusters.
- Node agent: A node agent manages all managed processes on a WebSphere Application Server on a node by communicating with the Network Deployment Manager to coordinate and synchronize the configuration.
Achieving HA with tWAS ND
In tWAS, the cell is the central scope for all HA considerations. High availability is achieved by defining clusters with at least two cluster members as part of a tWAS cell. With its built-in load balancing and failover capabilities, tWAS distributes requests to the available cluster members. For example, in case one cluster member fails, tWAS will automatically route future traffic to the remaining cluster member(s).
When designing HA environments, it is important to keep the rule of “at least 3” in mind:
- With three or more cluster members a smaller fraction of the capacity (1/n) becomes unavailable when one cluster member is lost (planned or unplanned). A two-cluster environment immediately loses half of its capacity.
- When one cluster member becomes unavailable for an environment with three or more clusters, the remaining environment can still provide some level of HA. In such a situation a two-cluster environment can no longer provide any fault tolerance.
Why would you consider spanning a WebSphere cell across data centers?
Below you find typical reasons (bold typeface) why customers think about spanning a WebSphere cell across data centers and respective evaluations:
- Goal: Increase scalability of your environment
If all you want is scalability, just add more systems, network bandwidth, and so on to a single DC. See below if you have concerns about “wasting” resources.
- Goal: Cover DR in addition to HA as our metro location DCs are close enough
In reality this setup neither properly addresses HA nor DR requirements.
To reduce the risk of network problems, these installations typically rely on a pair of data centers located close to each other (that is why the approach is called "metro"), and connected by a redundant, high-capacity network infrastructure.
While the metro pair DC setup can help to reduce the networking risk to ideally avoid the issues discussed before, it remains a Wide Area Network (WAN) type of topology. Based on their own risk profile a customer may decide to accept the risk and use metro pair DCs in an active/active setup.
This proximity of the DCs reduces protection against natural disasters and other events that affect a broader geographic region. For this reason, this type of approach is considered an advanced high availability technique (with remaining risks compared to a single DC) and not a solution for disaster recovery.
- Goal: Exploit the public Cloud provider’s Availability Zones (AZs) and Regions
Cloud providers often claim that Availability Zones are close enough to have low latency connections but far enough apart to reduce the likelihood of local outages. Regions are meant for Disaster Recovery that can withstand regional and large geography disasters.
AZs seem to be usable for an active/active setup with one cell spanning two AZs. For the sake of this discussion, this is still considered spanning two DCs (similar to the metro pair DCs), with all the implications mentioned before. While under advertised conditions AZs seem to be fine, there are at least three questions to consider:
- What are the contractual SLAs that the cloud provider guarantees (not just in normal cases but also in exceptional cases)?
- Are the SLAs for exceptional cases good enough to assume a virtual single DC topology?
- What liability lies with the cloud provider, when they miss their SLAs, and the customer suffers from a “split brain” or other problem situation caused by this?Depending on the answers, a customer may decide to go ahead with multiple AZs based on their own risk profile and potential help and compensation received from the cloud provider in case of problems.
Regions would be suitable for true disaster recovery following a DR approach supported by BAW.
- Assumption: Active/active improves utilization of the data centers
In reality an active/active topology at 40-50% utilization in each DC is equivalent to 80-90% utilization in the active one with the other being passive.
Running active/active at greater than 50% utilization of combined DC capacity can often result in complete loss of service when a DC outage happens, due to lack of enough capacity.
- Assumption: Cell spanning data centers is a requirement for a brief failover time — say, a small number of seconds or minutes
In reality based on how the WAS ND high availability manager works, achieving this stringent requirement is impossible. This is because failure detection and reconstruction of the WAS ND runtime state will take far longer than the required short time span of small number of seconds or minutes.
In summary, typical reasons for considering spanning a cell across DCs in active/active mode, are either based on wrong assumptions or can better be achieved by other approaches. Both leads to the realization that active/active is not required.
Why is spanning a tWAS cell across data centers strongly discouraged?
Below, we summarize the three main reasons mentioned by Tom Alcott. A more detailed explanation can be found in his original article and is illustrated in a presentation, both attached.
- The most obvious reason is speed and reliability of the network between the data centers:
- Within one data center a Local Area Network (LAN) is used, whereas two data centers require a Wide Area Network (WAN) to connect them.
- In many/most cases, the performance and reliability of a WAN is not as good as that of a LAN.
- There are environments where it is asserted that the WAN is sufficiently reliable and consistently provides LAN-like bandwidth. In such a case, the WAN appears to be the same as a LAN to applications (such as WebSphere Application Server).
- In practice, however, the assertion or presumption of a WAN that is as fast or reliable or both as a LAN is hardly ever realized nor verifiable.
- The second, less obvious reason is the risk of a “split brain” situation between DCs:
- Since you are going to align your clusters on core groups (the "HA domain" in WebSphere Application Server), if you suffer a network failure between, for example, two data centers, you run the risk of application and/or data inconsistency.
- Here's why: Core groups don't require quorum, so both halves can continue to run independently. This can lead to a “split brain” situation.
- And thirdly, increased WAN network latency can limit system performance already during normal operations:
- Due to how the WebSphere High Availability Manager works, a failure of a single JVM has minimal impact on the overall cell, whereas a widespread failure, such as a network outage, in many cases does affect the performance of all JVMs.
- In extreme cases, such as a cell spanning DCs where the network between DCs is broken, it is not uncommon for all application requests or processing to significantly slow down or even stop.
While above mentioned reasons focus on tWAS alone, when using BAW, in addition, there is always a database management system involved. This brings in another level of complexity, where the database can itself be run in an active/active or active/passive mode. The selected configuration has also implications on availability for the BAW clusters running in the two DCs.
As Tom Alcott put it in his blog entry: “Practical experience with large number of customers has proven that it is very unlikely that a customer will succeed building a cross-center clustering solution that works under all circumstances. Inevitably, support cases in such a setup result in the collection of lots of trace and some tuning changes, but in the end don't resolve the issues that prompted the case. Those issues are only resolved when the deployment is changed to a cell aligned with the data center boundaries.”
Achieving DR using active/passive with two DCs
BAW documents the Stray Node approach as one proven and often used DR approach. For this approach a WebSphere cell actually spans two DCs but only the nodes of one DC are active at any point in time. The nodes in the passive DC will just be started in case of a disaster that takes down the previously active DC.
This active/passive setup avoids all of the problems identified with the active/active approach and allows for DCs to be physically separated to support true DR.
For more details about the Stray Node approach refer to the product documentation.
Summary
In summary, due to the technical background explained above, the official answer to the initial question is:
An active/active configuration of IBM Business Automation Workflow with one IBM WebSphere Application Server cell spanning multiple data centers is strongly discouraged.
An often-asked question is one of support for a cell that does span data centers:
IBM Product Support will accept support cases for deployments when a cell spans data centers, but if Product Support determines that any issues are outside the design criteria for WebSphere Application Server Network Deployment (and in turn IBM middleware that leverages WebSphere Application Server Network Deployment), this will be classified as "works as designed" and no product fixes or changes will be delivered, leaving the customer solely responsible for problem resolution.
It is important to understand in summary, that:
A deployment spanning data centers with a single cell is not by default unsupported by IBM. Yet, such a deployment might lead to situations that are the result of using tWAS outside of its design criteria (having a single cell span data centers) that cannot be resolved or addressed by IBM Product Support.
Therefore, such a deployment should not be implemented. Otherwise, customers will end up being on their own resolving those problems.
Implementing HA is best done by having multiple cluster members in one DC. For DR purposes using an active/passive configuration, for example using the Stray Node approach, is the recommended pattern.
Acknowledgements
This post heavily builds on top of material from Tom Alcott and Chris Richardson. As this is not an academic publication, the authors refrained from proper citations and replicate a lot of the content without marking it.
Big thanks to Tom and Chris for their extensive coverage of the topic.
Resources