Enhancements for Shared Storage Pools
PowerVM continues to enhance Shared Storage Pools (SSP), PowerVM's cloud storage. SSP simplifies cloud management and improves storage efficiency. PowerVM 2.2.5 includes the following SSP enhancements:
Background on Shared Storage Pools
One aspect of PowerVM is known as VIOS SSP, which stands for VIOS Shared Storage Pools.
VIOS SSP allows a group of VIOS nodes to form a cluster and provision virtual storage to client LPARs. The VIOS nodes in the cluster all have access to the same underlying physical disks, which are grouped into a single pool of storage. A virtual disk or LU can be carved out of that storage pool and mapped to a client LPAR as a virtual SCSI device. An LU may be thin or thickly provisioned, where thin provisioned LUs do not reserve blocks until they are written to, while thickly provisioned blocks reserve their storage when the LU is created.
Once an LU has been created in the pool, snapshots or clones of that LU can be created. The number of snapshots and clones created is limited only by the amount of available storage in the pool, and creating these objects happens nearly instantly. Snapshots are used for rolling back to previous points in time. Clones are used for provisioning new space efficient copies of an LU. These clones are managed by PowerVC capture and deploy image management operations.
These features allow rapid deployment of new client LPARs in a Cloud Computing environment. The storage pooling model of VIOS SSP simplifies administration of large amounts of storage. The clustering aspect of VIOS SSP provides fault tolerance between VIOS multipathing pairs, and simplifies verification that other nodes can see the storage and are eligible for LPAR mobility operations.
Additional background information on VIOS SSP can be obtained from the IBM Knowledge Center or with a Redbook.
Cluster-wide Automatic Snapshot
Debug data collection across the cluster is often error prone and inconvenient. The following issues may appear during data collection:
- Admins may collect snaps from some nodes but not others. The node the problem is manifested on may not necessarily be the key node that debug information should be collected for.x
- Admins may not know they need to collect a snap until it is too late and logs have wrapped.
- Admins recovering from an outage will likely prioritize recovering from the outage over collecting data from the failure, thereby increasing the likelihood of log wrap or entire loss of debug information (For example, rebooting before taking a snap).
The solution to these issues is to automate the debug collection process with a cluster-wide snapshot that is triggered when a "major" problem or unexpected issue occurs on a cluster node. A snapshot is collected for each node in the cluster and then aggregated in a convenient manner. An example flow of this cluster operation is captured in the following diagram:
Some specifics on the cluster-wide snapshot include:
-
Each of the SSP components (CAA, RSCT, VIOS, pool) can trigger the cluster snapshot when a major problem occurs at that component level.
-
The current types of problems triggering a cluster snapshot are:
-
Network outages
-
Pool full condition
-
Other pool outages (for example, inability to write meta-data for a period of time)
-
Pool start failure
-
Cluster operation failures (cluster create, add/remove node)
-
LU operation failures (LU remove, LU move)
-
Tier create failure
-
Backup/Restore failure
-
Long running command failures (for example, remove PV or replace PV, etc)
-
Election failure for DBN node
-
Inability to appoint MFS manager in cluster or resign of MFS manager
-
Spam filtering is incorporated to avoid redundant snap requests. Only one cluster snapshot occurs at a time, so the same event registered on several nodes only generates a single snapshot.
-
Various SSP components also account for potential to spam with an ongoing problem condition.
-
For any nodes that are unreachable due to network and disk isolation, the snap will automatically be delayed until network and disk access is restored.
-
Once snaps have been taken on each node, they will be transferred to the initiator node and stored in a single compressed tar file.
-
The cluster snapshot file, csnap, is in tar.gz format and stored in /home/ios/logs/ssp_ffdc on the initiator node.
-
A cleanup policy is enforced to automatically delete older csnap files. The policy is based on the age and count of the csnap files. A maximum of 10 files are retained.
User initiated snaps can utilize the same cluster-wide framework via the clffdc command.
clffdc Command
Administrators can use the clffdc command to manually trigger snap collection for the various components. The syntax for this command is:
clffdc -c component [-l localCorrelator] [-p priority] [-v verbosity] [-f file]
[-n lineNumber] [-g correlator] [-s]
In regard to the various options:
-
The specified component can be VIOS, CAA, RSCT, pool, or FULL. The "FULL" component option produces "snap -a" on each node instead of a reduced snapshot providing only SSP specific information.
-
The priority indicates the severity of the failure: priority can be either 1 (high), 2 (medium), or 3 (low).
-
A unique correlator ID or value is utilized to associate node snapshots for a common cluster snapshot.
The csnap file has the format: csnap_date_time_by_component_priority_correlator.tar.gz.
For example a cluster snap shot generated by CAA with medium priority and correlator value 4 is: csnap_20161023_103735_by_caa_Med_c4.tar.gz.
Improved Network Outage Handling
Background
Network outage handling has primarily focused on the symmetric type of network isolation where disjoint islands of nodes cannot communicate with each other. An island of nodes can communicate with every other node in its the island over the network, but not with nodes from a different island.
If a node is isolated from the rest of cluster, then the leader will expel this node to allow forward progress in the cluster.
An asymmetric network outage is where network links are lost, but a node is not fully isolated. Handling for this situation was not previously optimized to minimize expels.
Improved Handling for Asymmetric Network Outages
Question: What is a network asymmetry?
- The VIOS nodes in the cluster play various roles.
-
Every node is a client of the storage pool.
-
There is also one leader, one MFS manager or server, and one DBN node per cluster at any given time.
-
Network asymmetry occurs when a client can communicate with the leader but not the server.
The following diagram shows a network asymmetry in regard to the MFS manager or server and various nodes in the cluster:
Question: How can we improve on the current handling and minimize the number of expels?
The existing algorithm for handling network asymmetry makes a local decision on the client node where the client forces itself to be expelled if it can't maintain connectivity with the server. The server is given priority over the client, which is not the best decision if several clients can communicate fine with each other and the problem lies with the server.
- A better algorithm for handling this condition is making a global decision from the perspective of the leader node.
-
If multiple clients are complaining about an unhealthy server, the leader can expel the server to minimize cluster impact.
-
The various nodes in the cluster communicate with the leader node on any experienced network issues with a server.
-
The leader can then make an informed decision on whether the server should be expelled.
-
Symmetric handling is given priority by ensuring the this more common handling kicks in first. Asymmetric handling starts after a longer time period (2 additional lease intervals after lease expiry with server), if required, so that both types of handling are not conflicting with each other.
Network Health FFDC
When a node is expelled at the pool level due to an apparent loss of network, there is always the question of whether this event was indeed due to a network outage. Generally this question cannot be answered post mortem because the system is no longer in the same state as when the problem occurred. The potential network issue can fall into several categories:
- A network problem that is specific to an individual SSP connection or set of connections (symmetric or asymmetric network outage).
- A network problem that is specific to all connections on a particular node (total network isolation).
- The network itself may be healthy, but the SSP threads may be unresponsive due to CPU starvation, or similarly starvation may even occur in the lower level network layer handling.
- A software bug with network handling among various layers.
The solution to this problem is to capture more network health state when an expel occurs:
- Capture more internal statistics on the SSP connections during runtime (client and server information).
- Capture ping results between nodes.
- Capture lparstat to check for thread starvation.
- Allow capture of additional network stats easily in the future, if desired, for example, tcpdump.
- Configurable script invoked at /opt/pool/dump.netstat
- Event output logged in /var/adm/pool/netstat.log
The network health capture is performed automatically at the time of expel on the leader node and the expelled node, but can be explicitly invoked via pooladm (after performing oem_setup_env on VIOS ):
pooladm dump netstat [-reset]
Example of Data Capture in netstat.log
The following is an example of the data collection logfile, netstat.log, for network health on the leader node of the cluster after an expel event:
########
# DATE #
########
Fri Oct 21 10:28:15 CDT 2016
#########
# NODES #
#########
Expelled:
vss7-c58.aus.stglabs.ibm.com
#######
# MSG #
#######
Client stats with server: #0 vss7-c57.aus.stglabs.ibm.com
numMsgSent: 196
numMsgRcvd: 373
avgRespTime: 0 sec 0 nsec
maxRespTime: 0 sec 0 nsec
Client stats with server: #1 vss7-c58.aus.stglabs.ibm.com
numMsgSent: 10
numMsgRcvd: 19
avgRespTime: 0 sec 0 nsec
maxRespTime: 0 sec 0 nsec
Client stats with server: #2 vss7-c59.aus.stglabs.ibm.com
numMsgSent: 8
numMsgRcvd: 183
avgRespTime: 0 sec 0 nsec
maxRespTime: 0 sec 0 nsec
Client aggregate stats:
numMsgSent: 214
numMsgRcvd: 575
avgRespTime: 0 sec 0 nsec
maxRespTime: 0 sec 0 nsec
Server stats:
numMsgSent: 1339
numMsgRcvd: 1611
avgRespTime: 0 sec 798475 nsec
maxRespTime: 0 sec 287679650 nsec
########
# PING #
########
PING vss7-c58.aus.stglabs.ibm.com: (9.3.148.120): 56 data bytes
--- vss7-c58.aus.stglabs.ibm.com ping statistics ---
10 packets transmitted, 0 packets received, 100% packet loss
########
# LPAR #
########
System configuration: type=Shared mode=Capped smt=4 lcpu=4 mem=3072MB psize=64 ent=1.00
%user %sys %wait %idle physc %entc lbusy vcsw phint
----- ----- ------ ------ ----- ----- ------ ----- -----
0.0 0.0 0.2 99.8 0.00 0.0 0.2 15823321 1608
Network Lease by Clock Tick
A reoccurring issue has been that an unexpected loss of network lease for a cluster node may occur when the system administrator changed the system time:
- The network lease with the leader was based on local time of day.
- The administrator had to stop SSP on that node prior to updating the system time.
- Otherwise if time was moved forward far enough, then the network lease with the leader expired and the node was expelled, which is more problematic if this is performed on several nodes at once.
- Starting up NTP (Network Time Protocol) with clocks out of sync could trigger this.
The solution to this problem is basing the network lease on clock ticks since boot time.
The use of NTP for synchronizing cluster node clocks is still recommended to assist with easier cluster log analysis.
Auto Log Analysis
Auto log analysis is currently a feature at the storage pool level to help address the difficulties in diagnosing pool problems in the cluster. This is motivated by various reasons:
- Analysis of debug data from all nodes in the cluster can be very time consuming.
- Even a high level analysis becomes unwieldy with larger clusters (16-24 nodes) and eventually will be impractical.
The solution to the increasing complexity in analyzing the storage pool is to provide a command line utility for pool analysis that provides a summary of important cluster state and changes with respect to the storage pool.
The auto log analysis:
- Utilizes the cluster wide auto snap framework that provides a consistent directory hierarchy.
- Detects and reports common problem signatures.
- Assists in quickly determining the problem node(s) that should be focused on.
The current summary information provided by this tool includes:
- A list of nodes in the cluster and current status.
- History of MFS managers.
- Expel history.
- Tiers in pool and details; disks in pool and usage.
- Recent command failures.
- In progress commands.
Command Options
Auto log analysis for the storage pool is invoked with the pooladm command. If analysis is not performed on the cluster itself, then the pooladm command must be copied over to the system for the analysis. There are several options that include: analysis of the cluster-wide snapshot, analysis of a single node (based on logs from a live system), and the ability to unpack the cluster-wide snapshot for analysis.
Analysis of cluster-wide snapshot
# pooladm analyze snap
snap <csnapPath> { [ -all ] | [ -nodeList | -mfsHistory [<maxEntries>] |
-expelHistory [<maxEntries>] | -tierList | -diskList [-v] |
-cmdFailures [<maxEntries>] | -cmdInProgress ] }
where:
<csnapPath> The absolute path to the unpacked cluster wide snap
<maxEntries> The max number of entries to display
Unpack of cluster-wide snapshot
# pooladm analyze unpack
unpack <csnapPath>
where:
<csnapPath> The absolute path to the packed cluster wide snap
Analysis of live system from single node logs
# pooladm analyze live
live [ -d <snapPath> ]
{ [ -all ] | [ -nodeList | -mfsHistory [<maxEntries>] |
-expelHistory [<maxEntries>] | -tierList | -diskList [-v] |
-cmdFailures [<maxEntries>] | -cmdInProgress ] }
where:
<snapPath> The absolute path to the snap for a single node
If not specified /var/adm/pool/pool.snap.system
is used.
<maxEntries> The max number of entries to display
Example Command Use
# pooladm analyze unpack /tmp/csnap_20161023_103735_by_caa_Med_c4.tar.gz
# ls /tmp/csnap
vss7-c57 vss7-c58 vss7-c59 vss7-c60
# pooladm analyze snap /tmp/csnap -all
=== Begin Node List ===
Node Name IP Address Status Leader MFS
vss7-c59.aus.stglabs.ibm.com 9.3.148.121 online Yes Yes
vss7-c60.aus.stglabs.ibm.com 9.3.148.127 online No No
vss7-c58.aus.stglabs.ibm.com 9.3.148.120 online No No
vss7-c57.aus.stglabs.ibm.com 9.3.148.119 online No No
=== End Node List ===
=== Begin MFS History ===
Node Name Timestamp Event
vss7-c59 Sun Oct 23 09:56:39 2016 Elected
vss7-c57 Sun Oct 23 09:56:33 2016 Expelled
vss7-c57 Sun Oct 23 09:47:42 2016 Elected
=== End MFS History ===
=== Begin Expel History ===
Node Name Timestamp Reason
vss7-c58.aus.stglabs.ibm.com Sun Oct 23 09:56:14 2016 leader majority, so nodes on network watch list are thrown out
# pooladm analyze snap /tmp/csnap -tierList -diskList
=== Begin Tier List ===
TierName Capacity Freespace EC NStale NECCR
================================================================================
tier1 9984 MB 9983 MB NONE 0 0
SYSTEM 4992 MB 4889 MB MIRROR2 0 0
=== End Tier List ===
=== Begin Disk List ===
Pool: /pool1
Node: vss7-c59
Disk Tier FG
=========================================
/dev/hdisk4 tier1 fg1
/dev/hdisk5 tier1 fg2
/dev/hdisk2 SYSTEM fg1
/dev/hdisk3 SYSTEM fg2
=== End Disk List ===
VIOSBR Automatic Backup
The viosbr command now has the ability to automatically take a backup of the VIOS and SSP configuration whenever there are any configuration changes.
- This is performed via a cron job that is triggered every hour and enabled on the system by default.
- Administrators can stop or start the feature and check the status.
The command and options are:
viosbr -autobackup {start | stop | status} [ -type {cluster | node} ]
In regard to the options:
- start option starts the autobackup feature.
- stop option stops the autobackup feature.
- status option checks if the autobackup file is up to date.
If SSP is configured, the cluster level backup file is present only in the default path of the database node.
We use the ‘save’ option to save the backup file to the default path of other nodes of the cluster:
viosbr -autobackup save
Only the latest copy of the backup file is stored in the default path /home/padmin/cfgbackups.
Here is an example of the backup files:
$ ls -l /home/padmin/cfgbackups
-rw-r--r-- 1 root system 5464 Oct 10 03:00 autoviosbr_jaguar9. tar.gz
-rw-r--r-- 1 root system 5464 Oct 10 03:00 autoviosbr_SSP.mycluster.tar.gz
- autoviosbr_jaguar9.tar.gz file contains the VIOS backup data.
- autoviosbr_SSP.mycluster.tar.gz file contains the SSP cluster level backup data.
VIOSBR for Disaster Recovery
Overview
The viosbr command is enhanced with a new disaster recovery option, 'viosbr -dr', that will restore the SSP cluster on a secondary setup with mirrored storage and a different set of hosts. The secondary setup can be a local site or a remote disaster site, but the prerequisite is mirrored storage across sites. Initially a backup of cluster configuration at the primary site is performed with the viosbr command, and upon the primary site failure, the viosbr command is invoked to restore the cluster configuration at the secondary site with a new set of VIO servers and mirrored storage. Note that this is a manual process for disaster recovery controlled by the administrator.
An overview of the disaster recovery process is at:
PowerVM Disaster Recovery (DR) with VIOS and SSP
Primary Site
The following steps are performed on the primary site for disaster recovery handling with the viosbr command:
- Enable storage level mirroring of all the disks (Storage1 and Storage2).
- Take the backup of the primary site cluster configurations with the viosbr command.
The following diagram shows an example configuration of the primary site with storage mirroring to secondary site.
Secondary Site
As one step in cluster restoration, the cluster is created on the secondary site by providing the following input for the viosbr command.
- Primary site backup file.
- New host name list for the SSP cluster definition.
- Disk list for the mirrored disks from Storage2.
Additional steps are required for restoring client LPARs and mappings from the primary site.
The following diagram shows the secondary site SSP configuration.
Command Usage
Here is an example invocation of the viosbr command for disaster recovery restoration on the secondary site with sample input files:
viosbr -dr -clustername mycluster -file systemA.mycluster.tar.gz -type cluster –typeInputs hostnames_file:/home/padmin/nodelist, pooldisks_file:/home/padmin/disklist -repopvs hdisk#
$ cat /home/padmin/nodelist
DRVIOS1
$ cat /home/padmin/disklist
hdisk1
hdisk3
hdisk2
Concluding Remarks
The enhancements of PowerVM 2.2.5 with Shared Storage Pools have focused primarily on improved resiliency of the product based on issues encountered in the field and from customer feedback. PowerVM will continue to enhance the resiliency and feature set provided by SSP in future releases.
Contacting the PowerVM Team
Have questions for the PowerVM team or want to learn more? Follow our discussion group on LinkedIn IBM PowerVM or IBM Community Discussions