Flash Storage

The value add of IBM Copy Services Manager for SVC / Storwize / Spectrum Virtualize and FlashSystem replication management

By Thomas Luther posted Thu April 02, 2020 08:34 AM


IBM Copy Services Manager (CSM), the follow on product of TPC for Replication, became generally available on 11. Dec. 2015. IBM Copy Services Manager is available as stand-alone installation for IBM Z z/OS as well as for various distributed server platforms (Windows, Linux, AIX). Independent of the server platform where CSM is running, it can manage storage replication of all supported storage products. Please see following links for latest product and support information:

With the various releases of CSM there have not only been support for new High End storage replication features (like DS8000 Multi Target PPRC), but also support for new features of SVC, the IBM Storwize family, the IBM Spectrum Virtualize family as well as the IBM FlashSystem family. CSM provides a couple of benefits when it comes to replication management on those storage products that I want to discuss in this article.

CSM benefits for SVC/Storwize/Spectrum Virtualize and FlashSystem replication users at a glance:

  • Continuous progress monitoring and alerting based on Session state changes
  • Central place for replication management across multiple systems including site awareness
  • Volume Protection to prevent using dedicated volumes in any kind of copy relations
  • Automatic restart of stopped Global Mirror (GM) relations caused by 1920/1720 events
  • Simplified management of GM with Change volumes
  • Export historical Recovery Point Objective (RPO) data for GM with Change Volume sessions
  • One time Session definition for replication configuration
  • One step actions to manage and modify Sessions
  • Dynamic Pictures and Warning prompts
  • Integrated Practice Volumes for Metro Mirror and Global Mirror Sessions
  • Replication Management flexibility is maintained for special user scenarios to prevent unnecessary full copies
  • Automation of Session management tasks
  • Additional protection on operational authorization across and within SVC/Storwize/Spectrum Virtualize and FlashSystem systems

Lets have a closer look to each of these benefits to explain in more detail the value add that CSM can provide.

Note: For simplification, we will only use the term Spectrum Virtualize in following discussion, but the benefits also apply to SVC, Storwize, and FlashSystem products.

Continuous progress Monitoring and Alerting

Once a Spectrum Virtualize Session is started in CSM, CSM begins to continuously report the actual replication progress by regularly polling the progress details from the hardware. Based on the progress changes in each interval, CSM even calculates and adjusts the estimated completion time for the synchronization. This is very essential information for an initial synchronization with a large amount of data to copy, since it allows IT administrators to define checkpoints and schedules when DR capability can and should be achieved. The CSM GUI with its dynamic update capabilities that can be defined to 5 second intervals, allows real time monitoring for the ongoing replication. In case the Session state changes because the replication is completed or because it stopped due to an error, CSM provides automatic SNMP and or e-Mail alerts that can be used to trigger operational staff and appropriate action plans. SessionProgress.jpg


Central place for replication management across multiple systems including site awareness

In CSM you can define multiple Spectrum Virtualize systems and assign site labels to each of them. The site label can be of various scope, like a specific system itself, the room, building or data center location. That means each system can have its own site label or you combine multiple systems under the same site location label. This site location can be adopted in session configurations, where it can be used as filter to prevent wrong replication configurations. The assigned labels are shown in the session details and provide quick visualization of the replication direction and location of your active production data.
Since CSM can manage multiple Spectrum Virtualize relationships in parallel, it provides a central replication management console across multiple system environments which the native system GUI or CLI cannot provide.

CSM offers following unique improvements for central replication management

  • Usage of CSM Session names for the Spectrum Virtualize consistency group naming:
    This allows direct mapping between CSM Sessions and created consistency group (CG) names on the hardware. CSM will ensure that only valid CG names are created and avoid duplicates. The hardware CG name is now also shown in the CSM Session details, which makes it easier for operators to identify the corresponding consistency group on the systems
  • Improved recognition of external commands against managed hardware CGs, such as Stop, Start, Failover and Switch actions:
    CSM now recognizes these CG changes properly and reflects the actual state and direction in the corresponding CSM session, so that subsequent commands issued against the CSM Session will initiate proper actions on the hardware CG and its relationships. This support allows better co-existence with external High Availability tools (like VMWare SRA management), which manage the hardware CG natively for failover or switch activities.

Volume Protection

In CSM you can protect dedicated volumes that should never be used in any replication definition, for example stand-alone host volumes. The Volume Protection feature will prevent that such volumes may become targets of a Session definition, which would overwrite valid host data upon Session activation.

Automatic restart of stopped Global Mirror relations caused by 1920/1720 events (CSM feature only)

Many Spectrum Virtualize Global Mirror customers have already suffered from 1920 or 1720 events. 1920 events are basically triggered when there is no sufficient replication bandwidth available between the clusters to write primary changes fast enough to the secondary device. 1720 events may occur when a synchronized relationship lost synchronization, e.g. due to intermittent or permanent link loss or due to out of space conditions on any of the replicated volumes.
The suspend reasons could be temporary for both type of events and subsequent restarts might resynchronize the session successfully again in order to maintain DR capability. Some error types might also be more of a permanent nature. There exists already a Perl Script toolkit in the Storwize developer community, but it requires some customization for proper implementation in a customer environment.
The basic 1920/1720 restart capability that was introduced with CSM will now try automated restarts. Once enabled (possible for each Spectrum Virtualize GM session type), CSM will try a certain amount of restarts within a 30 min window. These defaults can be changed per session in a properties file. The big risk of a basic auto-restart capability is that you may overwrite consistent target data and won't have a consistent, recoverable set of data for DR until the Session is synchronized again.  
Keep in mind, that automatic restarts might be the wrong thing to do in real disaster situations, since your session may never be able to fully resynchonize. Because of the auto-restart risk, CSM warns you that enabling this feature might cause data loss for the session:
Therefore the CSM product team is working on further enhancements of this auto-restart capability to apply best practices and a more sophisticated decision process, when a restart makes sense and should be tried. CSM added already following enhancements to the auto-restart capability:

  • Only auto-restart a session if all volumes are online (otherwise the restart would fail anyway on the hardware)
  • Automatically adapt default retry times to actual GM timeout settings on the cluster. This is based on the actual GM_Link_Tolerance setting of the cluster with the primaries since this determines the minimum delay of a subsequent suspend event. The dynamic adoption for the amount of retries can still be adjusted manually per session via a properties file.
  • Automatically switch a Spectrum Virtualize GM Session with Change Volumes into cycling mode to better cope with temporary bandwidth constrains.
  • Allow definition of an auto-restart delay time per session. This allows better adoption to customer environment specifics and may allow the environment to settle a bit before the auto-restart is performed. If auto-restart is used for multiple sessions, the delay can also be used to stagger the restart traffic across all sessions
Future CSM releases plan to support also the Spectrum Virtualize Consistency Protection feature, which requires Change Volumes defined on secondary sites to maintain the latest consistent data during the resynchronization process.

Simplified management of Global Mirror with Change Volumes

With CSM there is no need to go to both systems (the master and auxiliary) in order to associate the change volumes on each site. CSM communicates with both systems and when you define an GM with Change Volumes session in CSM, you just select the proper volumes for each role like it is done for other session types. CSM will automatically communicate with both systems to assign the change volumes as required when an action is performed against the Session. A GM with Change Volumes session can run with or without active change volumes. You can easily switch the mode in the Session properties when the Session is in an appropriate state to allow a mode switch.
GM with Change Volumes Sessions also provide enhanced reporting, which includes displaying the RPO of the Session. CSM also provides the ability to set warning & severe RPO thresholds, which will trigger an SNMP and / or e-mail alert when the session exceeds any of the thresholds.

Export historical Recovery Point Objective (RPO) data for GM with Change Volumes

CSM has always shown real time RPO date on the Session details panel for GM with Change Volumes Session types.  This is calculated every 30 seconds and helps customers to understand their current data exposure. Those RPO calculations will be stored over time, so that they can be used for historical RPO reporting. You can use the CSM GUI (Export -> Export Global Mirror Data) or the CLI (exportgmdata) to export the historical data to a CSV file which can then be uploaded into a spreadsheet or other charting tools in order to chart the RPO over time. Following picture illustrates the charting (right part) of the exported CSV file (left part) for a GM with Change Volumes Session in Excel:

Future CSM enhancements are planned to provide charting capabilities within the CSM GUI directly for all kind of historical RPO data collected by CSM.

One time Session definitions

Another benefit for replication management is that you can pre-define and store many replication Sessions in the persistent CSM repository. This is useful for once in a while replication or flash copy mapping in your environment. Instead of redefining the mappings every time you need them, a simple Start action to the pre-defined CSM Session will do all necessary tasks for you. A single Terminate command will cleanup all mappings again on the hardware if they are not longer needed without loosing the logical volume dependencies that are defined as part of the CSM session.

One step actions to manage and modify Sessions

For consistent management of Flash Copy and replication mappings on Spectrum Virtualize, you usually have multiple execution steps to create, activate or modify the logical consistency group containing all relevant mappings. Especially when modifying consistency groups, users might need to perform tasks in different order depending of an active or inactive CG. With CSM, users won't need to care anymore about proper sequence of steps for CG creation, modification or management. Once the CSM Session is defined with all required mappings, a single step will Start, Suspend, Recover or Modify the Session. The necessary tasks on the hardware for mappings and CG management are performed by CSM in the background in proper sequence.

Dynamic Pictures and Warning prompts

Management through the CSM GUI provides visual representation of the Session, especially on the Session Details panel. The user can easily follow the replication direction, including the ability to set site labels as discussed earlier, allowing for easier management and understanding of the replication environment. The actual states of the volumes and the relationships are reflected all the time with the GUI defined refresh interval.
Whenever an action command is executed against the Session, the warning prompt will visualize the resulting session and replication states and gives a description of what will actually happen. This allows the user to review and confirm his proposed action and ensure this action will really do what was intended by the user.

Integrated Practice Volumes for Metro Mirror and Global Mirror Sessions

Having additional volumes on the target side for Disaster Recovery (DR) tests have always been a good practice. However there are two different concepts how these practice volumes are integrated in the general failover/failback process during tests and real DR scenarios. The most common integration is to have a dedicated set of FlashCopy target volumes at the DR site only for the purpose of DR testing. When a DR test is conducted, the replication targets are made consistent and the FlashCopy provides the data on the FC targets, where it is accessed by hosts for application DR tests. The normal replication can be restarted immediately after the FlashCopy has been initialized. CSM supports this concept by utilizing 2 different Sessions, one for the replication part between the sites, and one for the FC part on the target site.
The big disadvantage of this concept is that recovery in real DR situations is NOT done from the FC targets, but from the MM or GM targets since Failback should be used to resynchronize both sites again at a later point. That means, your tests did not really cover the Failover/Failback scenarios as required for the real DR case. Just consider that your host mapping must change from the FC targets to the MM/GC targets between tests and real recovery scenarios. That may require zone or host mapping changes on the hardware as well as Operating System specifics to recognize the different (but logically same) volumes as valid volumes.
The practice volumes integrated into CSM Practice Session types follow the concept: "Practice how to recover and recover as you practiced". It means the MM or GM replication targets are usually not accessed by hosts at any time and are just used in the background to maintain replication while hosts are accessing their designated host volumes on the target site. Whenever consistent data for test or for real recovery is required at the target site, it is flashed from the consistent MM or GM replication targets to the designated host volumes by CSM. Of course, CSM will execute all necessary steps to make the replication targets consistent prior to the FlashCopy. But it is as easy as running the single Flash command in CSM. That means no matter whether you just test or go through a real recovery, you don't have to consider different steps at the target site to access your data and start applications, because all your zoning and host mapping does not need to be switched. The only disadvantage of this concept is that a full copy is required for Spectrum Virtualize relationships when the direction of the Session must be reversed, because the mappings are different for each direction and there is no simple failback possible.
The good thing: CSM lets the decision to the user which practice concept to be realized in his environment, and for the latter, more complex but also more valuable concept, it provides full integration via dedicated Practice Session types.

Replication Management flexibility is maintained

Some user scenarios might require special considerations to prevent unnecessary full copies. One might think that such special scenarios cannot be performed when CSM is used for replication management, but CSM features actually try to maintain greatest flexibility into that regard. For instance, one special scenario is to move a replication mapping from one CG into another one. While CSM has no capability to move individual Copy Sets directly from one Session into another one, it provides however the Soft Removal capability. This gives users the chance to remove active Copy Sets from the session, without that CSM deletes the actual mapping on the hardware. The mapping is only removed from the CG definition and if the CG is empty (e.g. because all Copy Sets are removed), CSM will also cleanup the CG itself on the hardware. CSM is capable of assimilating active replication mappings when starting the Copy Set in the Session, which means CSM does not require another full copy for a successfully assimilated mapping. In order to assimilate mappings successfully, the mapping must be in the proper copying state and not be part of any consistency group definition, because CSM creates and controls its own consistency group definitions based on the Session configuration. The Soft removal capability of CSM automatically leaves the hardware mapping in a condition that can be directly assimilated again in the same or another CSM Session with an appropriate copy type. The assimilation capabilities also allow to take any existing relationship into CSM Session management. Therefore if there is the need to establish replication without initial copy (e.g. because the volumes are already synchronized or have been newly created without any host mapping yet), a user could create this special relationship in the Spectrum Virtualize GUI or CLI and later on add the existing relationship to a CSM Session, so it will be added to the hardware CG that is managed by the CSM session.

Automation of Session management tasks

Starting with CSM 6.2.1, automated tasks can be defined to perform scheduled Flash commands in Sessions. Subsequent CSM releases introduced additional automation task capabilities. Meanwhile, all Session action commands can be used within such a task and a single task can operate on multiple sessions as well. Additionally, action types to wait for a certain state, check for recoverability of a role pair or wait until a certain role pair progress is reached can be defined. This allows great flexibility for built-in automated, scheduled session management without the need for external scheduling, scripting or automation tools.

Additional protection on operational authorization across and within Spectrum Virtualize systems

CSM provides its own authorization layer to control and monitor storage replication. Furthermore, the role concept of CSM allows to restrict operational CSM users to dedicated Sessions only. When such a Session management permission is combined with the volume protection feature, one can pretty much configure multi tenant  capabilities and restrict part of the replication environment to dedicated operational staff, which otherwise would not be possible.
Spectrum Virtualize and CSM also support both, local user authentication as well as LDAP user authentication. While you have to decide for one authentication mechanism on the hardware, you might configure a further optional mechanism on the CSM layer. This could allow operational staff to manage replication with their federated LDAP ID while having no direct access to the Spectrum Virtualize system itself.
Starting with CSM 6.2.5, you can even configure Dual Control on the CSM server. When this is activated, no single CSM user can perform an action without approval by a second CSM user with the appropriate user role. This provides maximum protection for replication management. This design helps to prevent malicious attacks against the CSM server and it provides added safety for any commands that are issued against the CSM server.