Primary Storage

 View Only

Using Performance Data to See Network Problems

By David Green posted Wed June 07, 2023 12:03 PM

I frequently work  with clients where the problem is a performance problem.   Either an entire system or an application is slow enough that users are affected.    Another frequent type of performance problem is with storage-side replication.  In these cases replication is not able to keep up with the production workload and RPOs are not being met.   Sometimes replication performance will also negatively impact production performance.  Replication is done most commonly between sites, though I have worked a few cases with same-site (or campus) replication.    
Whether you are using IBM DS8000 PPRC/Global Mirror, IBM SVC/FlashSystem Global Mirror (GM) or Global Mirror with Change Volumes (GMCV) you expect that the replicated data will be current up to a certain point in time behind the production data.  This is your Recovery Point Objective (RPO).    Your RPO is how current the replicated data needs to be.  For data that doesn't change often, an RPO of 30 minutes or an hour might be enough.  For data that frequently changes, an RPO of a few minutes might be required.  For weekly reporting, your RPO could be a few days or a week.   Using a personal example, my banking software data file has an RPO of 1 month, because I reconcile my accounts monthly.    It's not difficult to update the transactions again 
On IBM SVC or IBM Flashsystem, both GM and GMCV are asynchronous replication, meaning the production data is replicated at some point in time after the write is complete by the host.  A technology like Metro Mirror (MM) is synchronous.  This means the data is replicated as it is written.  The good status on the write is not returned to the host until the data has been successfully replicated.  While MM provides an always current RPO, it does so at the risk of a performance problem on the network affecting production.    Since most replication is site-to-site and across links that have at least some distance,  this risk increases with Metro Mirror.
For fibre-channel networks the distance links can be either fibre-channel native protocol running on DWDM, ONS, or some other underlying physical topology that is transparent to the switches, or it can be fibre-channel over IP (FCIP).    FCIP uses TCPIP networks to transmit the fibre-channel frames.  The scenarios we will talk about in this blog post can happen on native FC networks but are much more common on FCIP.  This is because fibre-channel is a lossless protocol that expects a lossless network.  This is a fancy way of saying the protocol assumes the network transmission medium will not lose data during transmission.   The data error checking and retransmission is done by the end devices.  TCP/IP is the opposite.  It assumes a lossy network, so there is a lot more overhead built into the protocol and network itself.  
Frequently when a performance problem manifests on FCIP infrastructure, it is not obvious on the fibre-channel routers that are performing the FCIP function.  You can look for clues such as the switches logging tunnel drops or other error messages.  You can also do thinks like look at the IP statistics.  However, those might not point to anything that clearly shows where the problem is occurring. 
In the diagrams below, we will look at performance data collected from some storage systems to illustrate how we can show the problem is in the network.  All of the charts below were captured from IBM Spectrum Virtualize (SVC or Flashsytem) storage systems.  They all are looking at the Port To Remote Node Send Response time, which is a measure of how long it is taking the remote cluster to respond to replication commands, and the Port to Remote Node Send Data Rate.   They are showing 3 slightly different manifestations of the same scenario.   Also, all of the data rates shown are well below the capacity of the networks so none of these are a case of an overworked network.    
An important note:  This blog post assumes you have already looked at the response times for the partner cluster and ruled it out as the source of the problem.  Before verifying whether the network is the issue, you have to look at the partner cluster to see what it's port to remote node receive response times are.  If they are elevated, then you would need to look at the remote first as the potential source of the problem.
In this first example, the solid lines are the response time metric.  The dashed lines are the data rates.  Note that for the first 1/3 of the chart, the data rates are low.  However, you can see that the response time is variable.  You can also see that the peaks in response time correspond to peaks in the data rates.  The response time should ideally be a flat, or nearly flat line most of the time.  The peaks in response time indicate variable latency on a regular basis.  Since the peaks correspond to workload (increased data rate) we can see this is workload related.   We know that this network has more capacity than what the workload is.  The conclusion then is that there are problems on the underlying LAN/WAN that the FCIP tunnels are running on.   
In Example number two, the scenario is a bit different and is less obvious.  The dashed lines are the response times, the solid lines with the higher peaks are the data rates.  We can see the data rates mostly under 100 MB/sec, with peaks over 400 MB/sec.   This translates to peaks of about 3 Gbps.  The underlying network was rated at 10 Gbps, so this workload is well within the specifications.   The response time looks a little better - it is more close to flat, however it should not vary this much with workload.  Response times are increasing by 3-5ms each time workload goes up.  

The last example is the most clear example of this scenario.  The dashed lines represent the response times.  The solid line with the much larger peaks is the workload.  You can see the highly variable response time.  This is a solid indication of a problem somewhere on the underlying network.  A 5ms variance in latency doesn't sound like much but the latency should not be this variable on a regular basis.  

Hopefully this shows you how you can use performance data to help identify the source of some replication problems.    While you would still need to do troubleshooting on the network itself, this at least should help you determine where the problem is, and give you something to show your network team to confirm the source of a problem.   All of the above charts came from IBM Storage Insights.  I strongly recommend using Storage Insights to help manage your storage and fabric.  You can find out more about Storage Insights here:

Getting Started with IBM Storage Insights