Primary Storage

The value add of Copy Services Manager for SVC / Storwize and FlashSystem replication management

By Archive User posted Thu December 10, 2015 07:59 AM

  

Originally posted by: Thomas Luther


IBM announced IBM Copy Services Manager (CSM) 6.1 as the follow on product of TPC for Replication to be generally available on 11. Dec. 2015. For z Systems there is a separate offering, for distributed platforms the CSM license is bundled with IBM Spectrum Control V5.2.8 and Virtual Storage Center V5.2.8 or later. Additionally, with release of CSM 6.1.1 on 24. Mar. 2016, there is a new standalone offering for CSM installation on distributed platforms. Following are the announcement links

IBM Copy Services Manager is available as stand-alone installation and download package under any of the above licensed programs. You can find the product documentation here:

With GA of CSM there will not only be support for new DS8000 features (like Multi Target PPRC with Metro Mirror and Global Mirror), but also new features for SVC/Storwize and FlashSystem replication. On top of that, CSM also supports the latest member of the IBM Storwize family like V5000, but also the FlashSystem V9000 and V840.
As you can see in my former blog post about CSM special bid releases, there is a new auto-restart capability for SVC/Storwize and FlashSystem Global Mirror type sessions when they hit 1920 or 1720 events. Restart automation has been requested in the field since quite a while, because the reason for those event types might be of temporary nature, and stopped replication might quickly violate existing service level agreements and Recovery Point Objectives, if relationship states are not monitored closely.
However, the auto-restart feature is not the only value add to SVC/Storwize ans FlashSystem replication customers, but CSM (like its predecessor TPC for Replication) provides many other benefits that I want to discuss in this blog.

Note:
Please also read the great articles on the CSM 6.1 and CSM 6.1.1 releases, which explain the newest product features in more detail. The latest 6.1.1 features related to SVC/Storwize and FlashSystem benefits have been incorporated in this article.


CSM and TPC-R benefits at a glance for SVC/Storwize and FlashSystem* replication users:

  • Continuous Progress Monitoring and alerting based on Session state changes
  • Central place for replication management across multiple systems including site awareness
  • Volume Protection to prevent using dedicated volumes in any kind of copy relations
  • Automatic restart of stopped Global Mirror relations caused by 1920/1720 events (CSM feature only)
  • Simplified management of GM with Change volumes
  • Export historical RPO data for GM with Change Volume sessions (CSM feature only)
  • One time Session definition for replication configuration
  • One step actions to manage and modify Sessions
  • Dynamic Pictures and Warning prompts
  • Integrated Practice Volumes for Metro Mirror and Global Mirror Sessions
  • Replication Management flexibility is maintained for special user scenarios to prevent unnecessary full copies
  • Additional protection on operational authorization across and within SVC/Storwize and FlashSystem systems

*Note:
Since the IBM FlashSystem support was just added in CSM, the benefits above apply to FlashSystem only with Copy Server Manager (not with TPC-R)


Lets have a closer look to each of these benefits to explain in more detail the value add that CSM can provide.

Continuous progress Monitoring and Alerting

Once an SVC/Storwize or FlashSystem Session is started in CSM, CSM begins to continuously report the actual replication progress by regularly polling the progress details from the hardware. Based on the progress changes in each interval, CSM even calculates and adjusts the estimated completion time for the synchronization. This is very essential information for an initial synchronization with a large amount of data to copy,  since it allows IT administrators to define checkpoints and schedules when DR capability can and should be achieved. The CSM GUI with its dynamic update capabilities that can be defined to 5 second intervals, allows real time monitoring for the ongoing replication. In case the Session state changes because the replication is completed or because it stopped due to an error, CSM provides automatic SNMP alerts that can be used to trigger operational staff and appropriate action plans.

image

 

Central place for replication management across multiple systems including site awareness

In CSM you can define multiple SVC/Storwize and FlashSystem systems and assign site labels to each of them. The site label can be of various scope, like a specific system itself, the room, building or DC location. That means each system can have its own site label or you combine multiple systems under the same site location label. This site location can be adopted in session configurations, where it can be used as filter to prevent wrong replication configurations. The assigned labels are shown in the session details and provide quick visualization of the replication direction and location of your active production data.
Since CSM can manage multiple SVC/Storwize and FlashSystem relationships in parallel, it provides a central replication management console across multiple system environments which the native system GUI or CLI cannot provide.

CSM offers following unique improvements for central replication management which were added in CSM 6.1.1

  • Usage of CSM Session names for the SVC/Storwize/FlashSystem consistency group naming.
    This allows direct mapping between CSM Sessions and created consistency groups on the hardware. CSM will ensure that only valid CG names are created and avoid duplicates. The hardware consistency group name is now also shown in the CSM Session details, which makes it easier for operators to identify the corresponding consistency group on the systems
  • Improved recognition of external commands against managed hardware CGs, such as Stop, Start, Failover and Switch actions.
    CSM now recognizes these CG changes properly and reflects the actual state and direction in the corresponding CSM session, so that subsequent commands issued against the CSM Session will initiate proper actions on the HW CG and relations. This support allows better co-existence with external HA tools (like VMWare SRA management), which manage the hardware CG natively for failover or switch activities.

Volume Protection

In CSM you can protect dedicated volumes that should never be used in any replication definition. E.g. Used stand-alone host volumes. The Volume Protection feature will prevent that such volumes may become targets of a Session definition, which would overwrite valid host data upon Session activation.

Automatic restart of stopped Global Mirror relations caused by 1920/1720 events (CSM feature only)

Many SVC/Storwize and FlashSystem Global Mirror  customers have already suffered from 1920 or 1720 events. 1920 events are basically triggered when there is no sufficient replication bandwidth available between the clusters to write primary changes fast enough to the secondary device. 1720 events may occur when a synchronized relationship lost synchronization, e.g. due to intermittent or permanent link loss or due to out of space conditions on any of the replicated volumes.
The suspend reasons could be temporary for both type of events and subsequent restarts might resynchronize the session successfully again in order to maintain DR capability. Some error types might also be more of a permanent nature. There exists already a Perl Script toolset in the Storwize developer community, but it requires some customization for proper implementation in a customer environment.
The basic 1920/1720 restart capability that was introduced with CSM special Bid drop 2 will now try automated restarts. Once enabled (possible for each SVC/Storwize and FlashSystem GM session type), it will a certain amount of restarts within a 30 min window. These defaults can be changed per session in a properties file. The big risk of a basic auto-restart capability is that you may overwrite consistent target data and won't have a consistent, recoverable set of data for DR until the Session is synchronized again.  
Keep in mind, that automatic restarts might be the wrong thing to do in real disaster situations, since your session may never be able to fully resynchonize. Because of the auto-restart risk, CSM warns you that enabling this feature might cause data loss for the session:

image

Therefore the CSM product team is working on further enhancements of this auto-restart capability to apply best practices and a more sophisticated decision process, when a restart makes sense and should be tried. The CSM GA release will add following enhancements to the auto-restart capability:

  • Only auto-restart a session if all volumes are online (otherwise the restart would fail anyway on the hardware)
  • Automatically adapt default retry times to actual GM timeout settings on the cluster. This is based on the actual GM_Link_Tolerance setting of the cluster with the primaries since this determines the minimum delay of a subsequent suspend event. The dynamic adoption for the amount of retries can still be adjusted manually per session via a properties file.
  • Automatically switch an SVC/Storwize or FlashSystem GM Session with Change Volumes into cycling mode to better cope with temporary bandwidth constrains.

CSM 6.1.1 added following enhancement to the auto-restart capability:

  • Allow definition of an auto-restart delay time per session. This allows better adoption to customer environment specifics and may allow the environment to settle a bit before the auto-restart is performed. If auto-restart is used for multiple sessions, the delay can also be used to stagger the restart traffic across all sessions

Future CSM releases might add further enhancements to the auto-restart capability:

  • Automatically take a golden snapshot of consistent targets prior trying a restart. This must be handled in different way for each GM Session type and it involves automatic creation of temporary snapshot targets on the HW if the session itself does not provide practice volumes on the target site. Those golden snapshots would be used for Recovery, if the session doesn't reach the Prepared state anymore and they would be automatically deleted after the Session returns back into a Prepared state.
  • Additional enhancements for considering an automatic switch back to normal GM mode based on schedule if the session is running in cycling mode in case the preferred mode is normal GM.


Simplified management of Global Mirror with Change volumes

With CSM there is no need to go to both systems (the master and auxiliary) in order to associate the change volumes on each site. CSM communicates with both systems and when you define an GM with Change Volumes session in CSM, you just select the proper volumes for each role like it is done for other session types. CSM will automatically communicate with both systems to assign the change volumes as required when an action is performed against the Session. . A GM with Change Volumes session can run with or without active change volumes. You can easily switch the mode in the Session properties when the session is in an appropriate state to allow a mode switch.
GM with Change Volumes sessions also provide enhanced reporting, which includes displaying the RPO of the Session. CSM also provides the ability to set warning & severe RPO thresholds, which will trigger an SNMP alert when the session exceeds any of the thresholds.

image

image

 

Export historical RPO data for GM with Change Volume sessions (CSM feature only)

IBM TPC-R and CSM have always shown a real time Recovery Point Objective (RPO) on the session details panel for GM with Change Volumes session types.  This is calculated every 30 seconds and helps a customer understand their current data exposure. Starting with the CSM 6.1.1 release, those RPO calculations will be stored over time, so that they can be used for historical reporting on the RPO for a duration in the life of the session. Through the GUI (Export -> Export Global Mirror Data) link or the CLI (exportgmdata) the historical data can be exported to a CSV file which can then be uploaded into a spreadsheet or other charting tool in order to chart the RPO over time.

Following picture illustrates the charting (right part) of the exported CSV file (left part) for a GM with Change Volumes Session in Excel:

image

Future enhancements are planned to provide charting capabilities within the CSM GUI directly for all kind of historical RPO data collected by CSM.

One time Session definitions

Another benefit for replication management is that you can pre-define and store many replication Sessions in the persistent CSM repository. This is useful for once in a while replication or flash copy mapping in your environment. Instead of redefining the mappings every time you need them, a simple Start action to the pre-defined CSM Session will do all necessary tasks for you. A single Terminate command will cleanup all mappings again on the hardware if they are not longer needed without loosing the logical volume dependencies that are defined as part of the CSM session.

One step actions to manage and modify Sessions

For consistent management of Flash Copy and replication mappings on SVC/Storwize and FlashSystem, you usually have multiple execution steps to create, activate or modify the logical consistency group containing all relevant mappings. Especially when modifying consistency groups, users might need to perform tasks in different order depending of an active or inactive CG. With CSM, users won't need to care anymore about proper sequence of steps for CG creation, modification or management. Once the CSM Session is defined with all required mappings, a single step will Start, Suspend, Recover or Modify the Session. The necessary tasks on the hardware for mappings and CG management are performed by CSM in the background in proper sequence.

Dynamic Pictures and Warning prompts

Management through the CSM GUI provides visual representation of the Session, especially on the Session Details panel. The user can easily follow the replication direction, including the ability to set site labels as discussed earlier, allowing for easier management and understanding of the replication environment. The actual states of the volumes and the relationship are reflected all the time with the GUI defined refresh interval.
Whenever an action command is executed against the Session, the warning prompt will visualize the resulting session and replication states and gives a description of what will actually happen. This allows the user to review and confirm his proposed action and ensures it really does what was intended by the user.

image

 

Integrated Practice Volumes for Metro Mirror and Global Mirror Sessions

Having additional volumes on the target side for DR tests has always been a good practice. However there are 2 concepts how these practice volumes are integrated in the general failover/failback process during tests and real DR scenarios. The most common integration is to have a set of separate FlashCopy target volumes at the DR site and for DR testing, the replication targets are made consistent and the FlashCopy provides the data on the FC targets, where it is accessed by hosts for application tests. The normal replication can be restarted immediately after the FlashCopy has been initialized. CSM supports this concept by utilizing 2 different sessions, one for the replication part between the sites, and one for the FC part on the target site.
The big disadvantage of this concept is that recovery in real DR situations is NOT done from the FC targets, but from the MM or GM Targets since Failback should be used to resynchronize both sites again at a later point. That means, your tests did not really cover the Failover/Failback scenarios as required for the real DR case. Just consider that your host mapping must change from the FC targets to the MM/GC Targets between tests and real recovery scenarios. That may require zone or host mapping changes on the HW as well as Operating System specifics to recognize the different (but logically same) volumes as valid volumes.
The practice volumes integrated into CSM practice type sessions follow the concept: "practice how to recover and recover as you practiced". It means the MM or GM targets are usually not accessed by hosts at any time and are just used in the background to maintain replication while hosts are accessing their designated host volumes on the target site. Whenever consistent data for test or for real recovery is required at the target site, it is flashed from the consistent MM or GM targets to the designated host volumes by CSM. Of course, CSM will execute all necessary steps to make the targets consistent prior to the FlashCopy. But it is as easy as running the single FLASH command in CSM. That means no matter whether you just test or go through a real recovery, you don't have to consider different steps at the target site to access your data and start applications, because all your zoning and host mapping does not need to be switched. The only disadvantage of this concept is that a full copy is required for SVC/Storwize and FlashSystem relationships when the direction of the Session must be reversed, because the mappings are different for each direction and there is no simple failback possible.
The good thing: CSM lets the decision to the user which practice concept to be realized in his environment, and for the latter, more complex but also more valuable concept, it provides full integration via dedicated session types.

Replication Management flexibility is maintained

Some user scenarios might require special considerations to prevent unnecessary full copies. One might think that such special scenarios cannot be performed when CSM is used for replication management, but CSM features actually try to maintain greatest flexibility into that regard. For instance, one special scenario is to move a replication mapping from one CG into another one. While CSM has no capability to move individual Copy Sets directly from one Session into another one, it provides however the Soft Removal capability. This gives users the chance to remove active Copy Sets from the session, without that CSM deletes the actual mapping on the hardware. The mapping is only removed from the CG definition and if the CG is empty (e.g. Because all Copy Sets are removed), CSM will also cleanup the CG itself on the hardware. CSM is capable of assimilating active replication mappings when Starting the Copy Set in the Session, which means CSM does not require another full copy for a successful assimilated mapping. In order to assimilate mappings successfully, the mapping must be in the proper copying state and not part of any consistency group definition, because CSM creates and controls its own consistency group definitions based on the Session configuration. The Soft removal capability of CSM automatically leaves the hardware mapping in a condition that can be directly assimilated again in the same or another CSM Session with an appropriate copy type. The assimilation capabilities also allow to take any existing relationship into CSM Session management. Therefore if there is the need to establish replication without initial copy (e.g. because the volumes are already synchronized or have been newly created without any host mapping yet), a user could create this special relationship in the SVC/Storwize and FlashSystem GUI or CLI and later on add the existing relationship to a CSM Session.

Additional protection on operational authorization across and within SVC/Storwize and FlashSystem systems

CSM provides its own authorization layer to control and monitor storage replication. Furthermore, the role concept of CSM allows to restrict operational CSM users to dedicated Sessions only. When such a Session management permission is combined with the volume protection feature, one can pretty much configure multi tenant  capabilities and restrict part of the replication environment to dedicated operational staff, which otherwise would not be possible.
SVC/Storwize/FlashSystem and CSM also support both, local user authentication as well as LDAP user authentication. While you have to decide for one authentication mechanism on the hardware, you might configure a further optional mechanism on the CSM layer. This could allow operational staff to manage replication with their federated LDAP ID while having no direct access to the SVC/Storwize or FlashSystem system itself.

 

3 comments
8 views

Permalink

Comments

Tue November 05, 2019 03:32 AM

Originally posted by: Thomas Luther


CSM is designed for replication management, not specifically for FC or Snapshot management or retention. However, it supports FC Sessions for the supported storage products. The CSM Scheduled Task feature may be utilized to automatically create/refresh a FC/Snapshot. Another Taskscould be defined to regularly delete a Snapshot.See https://www.ibm.com/support/knowledgecenter/SSESK4_6.2.6/com.ibm.storage.csm.help.doc/csm_t_creating_scheduled_tasks.html But this is no real Snapshot versioning if you are referring to such a capability. Note that CSM does not integrate into host platforms or applications to quiesce applications prior a snapshot is taken. External automation would be necessary if that would be required.

Mon November 04, 2019 10:53 AM

Originally posted by: Ivan Efremovski


Hi, Does Copy Services Manager, provide automatic Snapshot management (creation/deleting etc.), for Storwize systems?

Mon October 10, 2016 03:11 AM

Originally posted by: Werner_Bauer


Hi Thomas, Excellent block contribution which is very helpful and with very good technical background information. This outstanding quality is hard to find in these days ... Cheerio Werner