Over the last couple of decades organizations have invested in High Availability and Disaster Recovery solutions to protect their critical systems of record from both hardware failures and large scale disaster events. Synchronous and asynchronous replication and capabilities such as HyperSwap have removed the reliance on large scale restore from backups in these situations and provide faster restoration of service and more reliable protection. In many cases a complete enterprise restore from backup is not considered practical and is no longer performed as part of Disaster Recovery testing.
In more recent years there has been a massive increase in the threat from Cyber-attacks against both individuals and organizations. These have become more sophisticated and as well as hackers wanting to steal data they are also looking to destroy data or make it unavailable to hold for ransom. The perpetrators of these attacks are not always from outside the organization. The threat from a privileged insider with knowledge and access to systems is seen as a significant risk especially for platforms such as IBM Z where the threat from ransomware and other malware infection is less.
High Availability and Disaster Recovery capabilities do not provide any protection against these types of events and so something new is required. Solutions such as Pervasive Encryption on IBM Z can provide increased protection from data being stolen. However, encryption does not necessarily prevent the data from being destroyed or otherwise rendered unavailable. IBM has been developing solutions for these requirements including extensions to GDPS
, which will be covered below.
Regulators are also starting to provide guidance on protection from this type of event; both in terms of increasing security to minimize the probability of a successful attack and providing recovery capabilities in case these security mechanisms fail. Terms such as an “airgap” backup are being used where backup systems are isolated from the production network to avoid the same attack compromising both production and backup data.Protection Copies Concept
Securing offline backups of data on tape or e-vault object store systems can provide a copy of data that can be used for restore. The time to restore may mean that these are not be practical as the only protection mechanism, especially for complex and large scale systems. As a result, many organizations are looking to use instant copy technologies on primary storage systems as a first line of defense. Being able to restore one of these copies instantly to a separate recovery system also enables a wider range of use cases.
The picture below shows the protection copies concept used for Cyber Resilience. The source for these copies is either the production data or a replicated copy of this data. Protected copies are taken on a regular basis providing a number of recovery points. These are not accessed directly but can be either restored back to the production environment or copied to a recovery system where they can be used for a range of purposes.Use Cases for Protection Copies
There are a wide range of different events that could result in corruption or destruction of data and it is unlikely that a single recovery scenario is practical for every eventuality. We might consider a range of different events and actions
- Catastrophic events where the only option is to recover an entire system from a secure backup copy.
- Situations where forensic analysis is required to determine the cause and scope of a problem before deciding on a recovery action. In some cases, this analysis might suggest that it is most practical to fix the issue within the production environment rather than restoring from a backup.
- It might also be that the surgical recovery of a subset of data from the secure backup copy is required if the normal production backups are not usable and the problem was localised to only a subset of the data.
- In some situations, an attack or logical data problem may not be immediately obvious and might not normally be discovered for days, weeks or even months. Running corruption detection and data validation processes against a copy of data might be more practical than doing this in the live production environment and could provide earlier detection of a problem
- It might also be that a second line of defense is desired and so performing an offline backup of data from a consistent point in time copy of data can provide a greater retention period and increased isolation and security.
As well as malicious actions, application or operational errors can result in the same end result and so if a solution can provide improved recovery capabilities for this type of event as well it can provide additional justification for the investments required.
Virtual and Physical Isolation
Just as we have a range of different topologies for high availability and disaster recovery depending on the protection required we also can consider different topologies for Cyber Resilience solutions. The first decision for many organizations is whether they create an environment with physically isolation from production for their protection copies or whether virtual isolation on existing storage systems is considered sufficient.
For virtual isolation, the protection copies are created on one or more of the storage systems in the clients existing High Availability and Disaster Recovery topology. The example below shows synchronous replication being used for HA and DR with the protection copies being created on one of the existing storage systems.
For physical isolation, additional storage system(s) are used for the protection copies. These storage systems are typically not on the same SAN or IP network as the production environment and have restricted access perhaps even with different administrators to provide separation of duties. The example below shows such an environment
IBM has delivered a first set of solutions for Cyber Resilience on the IBM Z platform earlier this year with GDPS 4.1. GDPS supports the two virtual and physical isolation topologies shown in the examples above and exploits the Cascading FlashCopy capabilities of the DS8880 to provide up to 10 protection copies. Further enhancements are planned to be delivered later in 2018 via SPE and beyond this in future releases of GDPS.
Compared to High Availability and Disaster Recovery we are still in the early days of Cyber Resilience solutions. This post aims to provide an overview of the requirements and design considerations focusing on the protection of Enterprise Data and showing examples of the GDPS solutions for IBM Z.
It is important to remember the security aspects of Cyber Resilience are equally as important as the data recovery capabilities. This applies both to the production environment and to the recovery solution. Implementing role elevation and multi-factor authentication for privileged users and improving security audit capabilities for storage appliances are a few practical steps to tighten security of production environments.
In many cases it may be most practical to take an incremental approach to implementing a Cyber Resilience solution. It may be possible to leverage existing investments in High Availability and Disaster Recovery to provide some increased protection and to enable staff to build experience.