PowerVM

 View Only

VIOS 3.1.2 Shared Storage Pools (SSP) Enhancements

By Rob Gjertsen posted Thu November 19, 2020 03:22 PM

  

Enhancements for Shared Storage Pools (SSP)

The VIOS SSP 3.1.2. release primarily focuses on resiliency improvements with the storage pool. This includes several network resiliency enhancements with multiple network interfaces that build on the work from release 3.1.1: automatic failback to the primary network interface for the storage pool and improvement of viosbr restore with multiple network interfaces. The storage pool now provides locality awareness of VIOS pairs within a server. VIOS redundancy awareness allows for smarter decision making in choosing the manager node, reducing the potential for SSP issues. There were also improvements to the disk challenge mechanism that is used for additional protection in preventing multiple meta-data managers. Additionally, some other general resiliency items are included with the release.

Shared Storage Pool Background

One aspect of PowerVM is known as VIOS SSP, which stands for VIOS Shared Storage Pools.

VIOS SSP allows a group of VIOS nodes to form a cluster and provision virtual storage to client LPARs.  The VIOS nodes in the cluster all have access to the same underlying physical disks, which are grouped into a single pool of storage.  A virtual disk or LU can be carved out of that storage pool and mapped to a client LPAR as a virtual SCSI (vSCSI) device.  An LU may be thin or thickly provisioned, where thin provisioned LUs do not reserve blocks until they are written to, while thickly provisioned blocks reserve their storage when the LU is created.

Once an LU has been created in the pool, snapshots or clones of that LU can be created.  The number of snapshots and clones created is limited only by the amount of available storage in the pool, and creating these objects happens nearly instantly.  Snapshots are used for rolling back to previous points in time.  Clones are used for provisioning new space efficient copies of an LU.  These clones can be managed by PowerVC capture and deploy image management operations.

These features allow rapid deployment of new client LPARs in a cloud computing environment.  The storage pooling model of VIOS SSP simplifies administration of large amounts of storage.  The clustering aspect of VIOS SSP provides fault tolerance between VIOS multi-pathing pairs, and simplifies verification that other nodes can see the storage and are eligible for LPAR mobility operations. 


Storage Pool Automatic Network Failback

This work extends the functionality allowing multiple network interfaces to be used directly by the storage pool to improve overall network resiliency. The storage pool has a primary network interface on each cluster node, which has its IP address associated with the host name look-up, and secondary or redundant network interfaces can be added or removed via the cluster command. An active / passive model is used with the multiple network interfaces, so that only one interface is utilized at a time for general communication of the storage pool. The currently active interface is used until the network lease for the connection is in danger, which is when the network lease or heartbeat cannot be renewed in a timely manner, while the remaining network interfaces are in stand-by mode and their only communication is to maintain a network lease. When an issue occurs on the primary network interface, the storage pool switches to a secondary interface for communication, which now becomes the active interface for pool communication.

Previously the management of multiple interfaces did not automatically switch back to the primary interface once it was healthy again and required manual intervention to force this. Now the storage pool will automatically failback to the primary network interface once the network is available again on that interface. This is especially desirable in the case where different speed networks are utilized for redundancy: the user can designate the fastest network interface as primary, so the slower networks are associated with secondary interfaces, e.g., a 10 Gb primary network and a 1 Gb secondary network.

Currently the error log will have entries for when the storage pool switches to a different network interface and also when establishing or losing a network connection. These network events use the following error log labels: POOL_NODE_NETWORK, POOL_ESTABLISHED_CO, POOL_LOST_CONNECTIO. This is an easy way for the user to monitor the network activity for the storage pool in regard to the cluster node.

The connection details for a storage pool cluster node can be viewed with the VIO server root command via "
pooladm dump cnxs”. This information typically is used for obtaining details in problem troubleshooting, but it is useful to show a network transition in action with the storage pool going from the primary to secondary interface and vice-versa. Note that the example also uses a pool communication disk for an additional level of redundancy in case both networks were impacted.

Network impact on primary interface and resulting failover to secondary interface:


# pooladm dump cnxs
ClusterName=sfstore MyNodeName=vss7-c57.aus.stglabs.ibm.com:

CONNECTIONS WITH SERVERS:
...
Server: name=vss7-c58.aus.stglabs.ibm.com nodeId=0xE1413D1204BA11EB:8004AE25C8FC3904 lease state=VALID
lease phase=FIRST ip=9.3.148.120 (BKUP-LEADER)
Net [0]: Lease state=EXPIRED phase=FINAL cnx=NOT_AVAILABLE IPAddr=9.3.148.120 (PrimaryIP)
Lease renew=104 LRSent=114 LRAcked=101
MsgSent=3 MsgRcvd=210 avgRespTime: 0 sec 0 nsec
avgDeliverTime: 0 sec 0 nsec maxDeliverTime: 0 sec 0 nsec
ConnSwitchMsg sent=0 ackd=0 markDlvrd=0 lastAckedTxnIdByServer=0 CSServerNo=0
cnxCheckPending=0
Net [1]: Lease state=VALID phase=FIRST cnx=ACTIVE IPAddr=10.10.201.58
Lease renew=297 LRSent=294 LRAcked=294
MsgSent=3 MsgRcvd=596 avgRespTime: 0 sec 0 nsec
avgDeliverTime: 0 sec 0 nsec maxDeliverTime: 0 sec 0 nsec
ConnSwitchMsg sent=0 ackd=0 markDlvrd=0 lastAckedTxnIdByServer=0 CSServerNo=0
cnxCheckPending=0
DiskCom [2]: Lease state=VALID phase=FIRST cnx=STANDBY
Lease renew=293 LRSent=293 LRAcked=293
MsgSent=0 MsgRcvd=588 avgRespTime: 0 sec 0 nsec
avgDeliverTime: 0 sec 0 nsec maxDeliverTime: 0 sec 0 nsec
ConnSwitchMsg sent=0 ackd=0 markDlvrd=0 lastAckedTxnIdByServer=0 CSServerNo=0
cnxCheckPending=0

CONNECTIONS WITH CLIENTS:
...
Client: name=vss7-c58.aus.stglabs.ibm.com nodeId=0xE1413D1204BA11EB:8004AE25C8FC3904 state=IDENTIFIED
Id=0x062F1B1E:0000000
DiskCom[0]: state=IDENTIFIED cnx=STANDBY
Net [1]: state=IDENTIFIED cnx=ACTIVE IPAddr=10.10.201.58

Network issue is resolved on primary network and resulting failback to primary interface:


# pooladm dump cnxs
ClusterName=sfstore MyNodeName=vss7-c57.aus.stglabs.ibm.com:

CONNECTIONS WITH SERVERS:

Server: name=vss7-c58.aus.stglabs.ibm.com nodeId=0xE1413D1204BA11EB:8004AE25C8FC
3904 lease state=VALID
lease phase=FIRST ip=9.3.148.120 (BKUP-LEADER)
Net [0]: Lease state=VALID phase=FIRST cnx=ACTIVE IPAddr=9.3.148.120 (PrimaryIP)
Lease renew=107 LRSent=116 LRAcked=103
MsgSent=4 MsgRcvd=218 avgRespTime: 0 sec 0 nsec
avgDeliverTime: 0 sec 0 nsec maxDeliverTime: 0 sec 0 nsec
ConnSwitchMsg sent=0 ackd=0 markDlvrd=0 lastAckedTxnIdByServer=0 CSServerNo=0
cnxCheckPending=0
Net [1]: Lease state=VALID phase=FIRST cnx=STANDBY IPAddr=10.10.201.58
Lease renew=329 LRSent=326 LRAcked=326
MsgSent=3 MsgRcvd=660 avgRespTime: 0 sec 0 nsec
avgDeliverTime: 0 sec 0 nsec maxDeliverTime: 0 sec 0 nsec
ConnSwitchMsg sent=0 ackd=0 markDlvrd=0 lastAckedTxnIdByServer=0 CSServerNo=0
cnxCheckPending=0
DiskCom [2]: Lease state=VALID phase=FIRST cnx=STANDBY
Lease renew=324 LRSent=324 LRAcked=324
MsgSent=0 MsgRcvd=650 avgRespTime: 0 sec 0 nsec
avgDeliverTime: 0 sec 0 nsec maxDeliverTime: 0 sec 0 nsec
ConnSwitchMsg sent=0 ackd=0 markDlvrd=0 lastAckedTxnIdByServer=0 CSServerNo=0
cnxCheckPending=0

CONNECTIONS WITH CLIENTS:

Client: name=vss7-c58.aus.stglabs.ibm.com nodeId=0xE1413D1204BA11EB:8004AE25C8FC3904 state=IDENTIFIED
Id=0x062F1B1E:00000001
Net [0]: state=IDENTIFIED cnx=ACTIVE IPAddr=9.3.148.120
DiskCom[1]: state=IDENTIFIED cnx=STANDBY

Net [2]: state=IDENTIFIED cnx=STANDBY IPAddr=10.10.201.58

VIOSBR with Multiple Network Interfaces

The storage pool can utilize multiple network interfaces and additional network interfaces are added and removed on the VIOS CLI via the "cluster" command  with options "-addips" and "-rmips". Previously with viosbr backup and restore, the secondary network interfaces were not automatically restored and the user had to manually add them back afterwards (only primary interface restored).

viosbr now will restore the secondary network interfaces. However, an additional step is required to ensure the storage pool recognizes secondary interfaces. The user needs to restart one VIO server in the cluster such as with the "clstartstop" stop/start sequence and then the storage pool will recognize the restored secondary interfaces.  Avoiding any additional steps in this scenario is something we are investigating for the future.

Storage Pool Locality Awareness

This item introduces locality awareness of VIOS redundancy in the storage pool in order to make more intelligent decisions by
factoring in the cluster topology. Specifically, the selection of a new meta-data / MFS manager will consider which nodes are on
the same frame or CEC. If at all possible, the next MFS manager is selected from a different CEC or frame from that of the current MFS manager.
The transition of MFS manager can be a more stressful operation, so using this heuristic helps reduce the probability of dual VIOS pair failure
that would impact LPAR clients for those 2 VIO servers. This information may also be used in other scenarios in the future.

Consider the scenario where the original MFS manager crashes and then the new MFS manager encounters a problem that forces the pool to be taken offline on that node; if the second MFS manager candidate is part of the same VIO server pair, then its clients will now be impacted. However, by choosing an MFS manager on another frame, then this problem can be avoided even with a double server failure.

The example below shows this new method in action for a  4 node cluster on 2 frames. The original MFS manager node in red has failed and now the
next MFS manager node in green is selected from the other frame.

Storage Pool Disk Challenge Improvements

The storage pool disk challenge is a secondary protection method to avoid having multiple MFS managers that can occur with a split-brain scenario. This method employees time boxing to ensure the current MFS manager is able to read a special challenge disk sector and respond to challenges from other nodes attempting to become MFS manager. Several improvements have been introduced to further minimize unnecessary MFS manager transitions or delays in manager transition and also general robustness. These changes include

  • Increasing the challenge I/O timeout interval (5 to 10 seconds) to allow more tolerance for SAN or disk delays.
  • Improving the handling of challenge disk failure that can delay manager transition
    • Quick replacement of the challenge disk when a challenge thread is stuck on I/O (i.e., not waiting on the full I/O failure time out).
    • Ability to swap a challenge disk out of the replica set prior to the MFS manager resigning due to disk challenge timeout.
  • Strictly enforce time-box for stopping the MFS manager on disk challenge failure or timeout when conditions are favorable for a split-brain scenario (and asserting the VIO server if necessary when the MFS manager takes too long to stop).

General Resiliency

Various other improvements in resiliency include:

  • Improving FFDC for the storage pool replica set. The replica set is utilized for redundancy of the highest level on-disk meta-data and also to determine when a pool may be safely started. This change adds in periodic checks of the replica set disks for validation of the highest level on-disk data (disk root and meta-root validation). In case an issue is detected, then the storage pool will be stopped immediately to ensure the on-disk data is not further modified, which helps with understanding how an inconsistency occurred and also improves the chance of recovery. Previously this type of issue may not have been detected until an MFS manager change occurred with the actual problem potentially happening quite some time before.
  • Some pool start / stop improvements
    • Faster pool start with parallel read of meta-roots from system tier disks.
    • Enhancements for pool critical events used with the start and stop of the MFS manager. Critical event handling ensures that these operations are performed in a timely manner to avoid a hang / deadlock situation.
    • Using an adaptive delay for selecting MFS managers when MFS manager failures occur over all cluster nodes (versus a longer fixed delay).
  • Additional hardening of the VIOS DBN election.

Concluding Remarks

The enhancement of Shared Storage Pools in PowerVM  3.1.2 has focused primarily on improved resiliency of the product based on issues encountered in the field and from customer feedback. PowerVM will continue to enhance the resiliency and feature set provided by SSP in future releases. Please contact the author if you have any questions about the new features, SSP in general, or any other feedback on the product.

Contacting the PowerVM Team

Have questions for the PowerVM team or want to learn more?  Follow our discussion group on LinkedIn IBM PowerVM or IBM Community Discussions

0 comments
84 views

Permalink