Primary Storage

 View Only
Expand all | Collapse all

Path loss in an NPIV fully enabled environment

  • 1.  Path loss in an NPIV fully enabled environment

    IBM Champion
    Posted Mon November 28, 2022 12:43 PM
    A few weeks back, we upgraded an SVC, and some ESX servers experienced path loss. NPIV is fully enabled on that SVC, so we know they are using the virtual port. Unfortunately, we didn't really think about the fact that this shouldn't have happened until it was too late to collect snaps, etc.

    Has anyone else experienced this? Any thoughts on where to look?

    ------------------------------
    Jonathan Fosburgh, MS, CAPM
    Principal Application System Analyst
    The University of Texas MD Anderson Cancer Center
    Houston TX
    713-745-9346
    ------------------------------


  • 2.  RE: Path loss in an NPIV fully enabled environment

    IBM Champion
    Posted Tue November 29, 2022 10:05 AM
    First of all you need to ensure that equivalent ports are on the same fabric and in the same zone. 
    For example;
    if your host zoned with SVC_Node1_P1 on Fab1, SVC_Node2_P1 must be connected to same switch (or fabric) and same host port must be zoned with VC_Node2_P1. Because during NPIV failover N1_P1's wwpn transparently failover to N2_P1. If host have an access to N2_P1 on same switch, it continues to operation. IBM Storage Insights have an alert for this ( only if mis-cabling not the wrong zone config). 

    The real question is what is your ESX multipath algorithm? MRU? Round Robin? 
    With round robin, ESX must be continue to its io operation with the next available path (with or without npiv). But if ESX uses MRU for it's datastore devices it fails because of multipath algorithm because MRU (most recently used) uses only one (latest active path) and wants to use that single path forever. 




    ------------------------------
    Nezih Boyacioglu
    ------------------------------



  • 3.  RE: Path loss in an NPIV fully enabled environment

    IBM Champion
    Posted Tue November 29, 2022 10:59 AM
    Edited by Randy Frye Tue November 29, 2022 10:59 AM
    Jonathan,
      Two things come to my mind for this, as I've experienced this same issue, and had no obvious 'misconfgurations' like using MRU multipathing or having equivalent NPIV ports on different fabrics...

    1. What source and target code versions was your upgrade?  Storwize code released "Improved failover times" in 8.3.1 code (I believe) that reduced the amount of time it takes for NPIV virtual ports to failover from owning node to partner node - especially relevant for code updates.  8.3.1.0 also included a fix to a bug that could result in stuck SCSI2 reservations for ESX hosts (Search https://www.ibm.com/support/pages/ibm-spectrum-virtualize-apars for HU01894).
      And keep in mind, because of the amount of time it takes for NPIV to failover a port (I've seen as long as 18 seconds during pre-8.3 code updates), path failures in ESX should be EXPECTED.  They should not be a siginificant concern, as long as you're NOT also seeing VMware report loss of access to the datastore volume.

    2. What setting do your ESX hosts use for the selection policy in your Round Robin multipathing?  By default configuration, VMware will send 1000 IO's down the currently active path before switching to the next path.  Given that it can take some time for ESX to 'see' offline paths, this can result in many hundreds of IO's failing to be sent.  IBM (and VMware) recommend changing this selection policy to ONE IO, as per following VMware link:  https://kb.vmware.com/s/article/2069356

    ------------------------------
    Randy Frye
    Senior Storage Administrator
    D&H Distributing
    Harrisburg PA
    7173647948
    ------------------------------



  • 4.  RE: Path loss in an NPIV fully enabled environment

    IBM Champion
    Posted Tue November 29, 2022 11:05 AM
    No warnings about ports not on equiv fabrics.

    We provided our ESX admins with the IBM recommended setting of: PSP = Round Robin and IOPS =1

    We are waiting on them to verify.

    ------------------------------
    Jonathan Fosburgh , MS, CAPM
    Principal Application System Analyst
    The University of Texas MD Anderson Cancer Center
    Houston TX
    713-745-9346
    ------------------------------



  • 5.  RE: Path loss in an NPIV fully enabled environment

    IBM Champion
    Posted Tue November 29, 2022 11:14 AM
    8.2.1.11 to 8.3.1.6.

    We are awaiting confirmation that they are using the recommended setting of PSP = Round Robin and IOPS =1

    ------------------------------
    Jonathan Fosburgh, MS, CAPM
    Principal Application System Analyst
    The University of Texas MD Anderson Cancer Center
    Houston TX
    713-745-9346
    ------------------------------



  • 6.  RE: Path loss in an NPIV fully enabled environment

    IBM Champion
    Posted Tue November 29, 2022 01:37 PM
    The issues that we've had relate to situations where the ESX host has lost some of its paths to a lun. Even though the host hba ports are online and have some luns being served on them. From the storage, all is well. But--instead of 4 paths ( for example ). It might see less than that. If it does and everything is rebooted, you might end up in a bad situation. The work around, in that case, is to reboot the ESX host. 

    This may or may not be something that is related to your issue-but it is one issue that we've seen and that is included our pre-upgrade planning. We get the path info to all luns going to the ESX hosts and count it up per lun. In your case--that'd need to be done by that team ( if you were going to do it )-but I know for a fact that we've avoided certain issues on hosts before upgrades, because the issue was identified and the host rebooted prior to the upgrade starting. 


    ------------------------------
    Robert Mayotte
    ------------------------------



  • 7.  RE: Path loss in an NPIV fully enabled environment

    IBM Champion
    Posted Tue November 29, 2022 02:03 PM
    Pre-flight checks like this have long been part of our standard procedures.

    ------------------------------
    Jonathan Fosburgh, MS, CAPM
    Principal Application System Analyst
    The University of Texas MD Anderson Cancer Center
    Houston TX
    713-745-9346
    ------------------------------



  • 8.  RE: Path loss in an NPIV fully enabled environment

    User Group Leader
    Posted Tue November 29, 2022 07:31 PM
    In the case of NVMeoF connection, VMware says High Performance Plug-In should be used and I believe it's the default starting from vSphere 7.0 Update 2:
    • https://docs.vmware.com/en/VMware-vSphere/7.0/com.vmware.vsphere.storage.doc/GUID-F7B60A5A-D077-4E37-8CA7-8CB912173D24.html
    Have you checked this on your ESXi severs?

    ------------------------------
    Keigo Matsubara, Storage Solution CTS, IBM Japan
    ------------------------------



  • 9.  RE: Path loss in an NPIV fully enabled environment

    IBM Champion
    Posted Wed November 30, 2022 06:29 AM
    This is standard Fibre Channel.

    ------------------------------
    Jonathan Fosburgh, MS, CAPM
    Principal Application System Analyst
    The University of Texas MD Anderson Cancer Center
    Houston TX
    713-745-9346
    ------------------------------



  • 10.  RE: Path loss in an NPIV fully enabled environment

    Posted Wed November 30, 2022 06:03 AM
    In support we regularly see this kind of error being reported by customers.
    There are situations, where certain host OS did encounter access loss upon e.g. node reboots during code upgrade, while other types of hosts attached to the same SpecV system hardly noticed an SVC node going offline.
    Without any blaming in mind, I dare say VMware ESX hosts appear to be affected more often than others. This may be, however, due to the wide distribution of ESX in the field.
    There is one general recommendation related to VMware ESX hosts I tend to share with customers when it comes to any kind of access loss problems:
    At any time, make sure the host is operating with the latest and greatest VMware set of updates, especially HBA driver updates, provided by either as VMware udpate package or by the HBA vendor.
    Last not least, I was working with customers in similar situations which eventually resulted in software fixes built by VMware to address specific errors.

    ------------------------------
    Christian Schroeder
    IBM SpecV Storage Support with Passion
    ------------------------------