PowerVM

Live Partition Mobility (LPM) Performance Tips and Results

By Pete Heyrman posted Wed June 17, 2020 08:57 AM

  
PowerVM LPM Logo Speed
Live Partition Mobility (LPM) allows active partitions to be migrated from one Power server to another Power server without any downtime.  LPM provides for 24x7 operations which enables moving workloads without downtime for planned outage such as for hardware repairs or firmware updates.  One of the issues faced by customers is a limited amount of time available to perform these updates.  The following sections provide some performance hints to accelerate workload movement and also some LPM results to help you better plan your LPM operations.

Network Performance
With Live Partition Mobility, the majority of the time to complete the LPM operation is the time spent transferring the contents of the partition memory from the source system to the target system so the faster the network connection the faster the LPM operation.  LPM is supported on 1Gb networks but 10Gb and higher network connections are recommended to optimize the performance.  With PowerVM 2.2.4, the VIOS supports link aggregation so if you do not have 10Gb network infrastructure in your environment, you can use link aggregation to provide additional performance.  It also helps to have dedicated network connections for your LPM operations.  This ensures that the planned capacity is available for LPM and it protects a shared connection that is used for other business operations from becoming saturated when LPM operations are started.  If you use a separate connection, use the source_msp_ipaddr and dest_msp_ipaddr parameters on the migrlpar command to specify the separate LPM network link.


 

For the network connection, the use of the jumbo frame, large send and large receive offload (LRO) support provides enhanced network performance for LPM over the typical default 1500 byte packet size.  LPM is very similar to a function like file transfer protocol (ftp) where the goal is to transfer as much data as fast as possible from one server to another.  Usage of jumbo frames and LRO will reduce the CPU consumption on the server by sending more data per request.  It also helps saturate the network connection as more data can be transmitted in a single network request.  In order to use these features, all network interfaces and devices need to support these options, which may require you to use a separate infrastructure just for LPM.  It’s worth mentioning that jumbo frames become more important the faster the network infrastructure. E.g. We know from performance testing that on a 10Gb network using jumbo frames has at most a 10% improvement, but, on a 40Gb network the performance with jumbo frames is more like a 40% improvement. 

VIOS Performance
Active LPM operations will consume additional CPU resources to manage the data movement from the source system to the target system. You need to have sufficient resources added to the VIO Servers (VIOS) when using 10Gb network connections. It is recommended on a Power8 server to have 2 additional physical CPUs assigned to a VIOS that uses dedicated processors or you need 2 additional virtual CPUs assigned to the VIOS in uncapped shared processor configuration. This is necessary to ensure there are sufficient resources for optimal LPM performance.  The recommendation for a VIOS using other networks or other models of Power processors can be found at the IBM Knowledge Center (VIOS LPM Performance Recommendations).


Starting with VIOS version 2.2.2.0, you can configure the resource level used by the VIOS for LPM operations.  Additional resources may provide improved LPM performance.  Concurrency level 5 provides the least amount of resources for the LPM operation.  At level 5, less memory is set-aside for LPM buffers and other resources for the LPM operation.  Concurrency level 3 is the default level and should provide sufficient resources to drive a 10Gb network connection. 

Starting with VIOS version 2.2.4.0 and server firmware level FW840, additional resources are controlled by the concurrency level setting.  Concurrency level 4 is the default level with VIOS 2.2.4 and should provide sufficient resources to saturate a 10Gb network connection.  Concurrency levels 3 through 1 are meant to drive network speeds greater than 10Gb.  These levels not only allocate larger buffers but also provide additional CPU threads to drive additional bandwidth.  More information about the concurrency level attribute can be found at the IBM Knowledge Center (Concurrency Level Attribute).

Another aspect of LPM performance is the time it takes to validate the storage connections.  As one would expect, the more storage adapters assigned to the VIOS, the longer it will take the VIOS to validate the storage connections.  Also, with VIOS version 2.2.4.0 there are two levels of validation for NPIV attached storage.  The default is to perform validation only at the port level.  This ensures that the storage device is accessible to the target VIOS for LPM operations.  The VIOS also offers validation for NPIV attached storage at the port and disk level.  Disk level validation ensures that all assigned disks (LUNs) are accessible to the target VIOS.  As would be expected, disk level validation takes more time to perform than just performing port level validation.  If you have previously used disk level validation and there have been no changes to your storage configuration, the time to perform the LPM operation can be reduced with port level validation.  You can find more information about VIOS NPIV validation at the IBM Knowledge Center (NPIV LUN or disk level validation).

Operating System Performance
There aren’t any specific operating system settings that result in better LPM performance but there are some techniques that can be used to reduce the amount of data that needs to be transferred from the source system to the target system.  When a partition is inactive (powered off), none of the partitions memory contents need to be transferred from the source system to the target system.  If you are doing a server evacuation, you may want to consider shutting down non-essential partitions prior to the server evacuation to reduce the overall LPM time.  Once the inactivate partitions have been transferred to the target system, re-activate (boot) the partitions from the management console.


The amount of data that needs to be transmitted for an LPM operation can vary based on the activity level of the partition.  The way LPM works is there is a speculative phase where the partition is still running on the source system but the memory contents are transmitted from the source system to the target system.  After this first copy of the data has completed, the partition starts running on the target system.  The LPM operation knows all the data that is stale on the target system (was modified after the initial copy) and this data needs to be transferred a second time from the source system to the target system.  The busier the partition, the more likely data has changed and needs to be transferred again to the target system.  If you are able to schedule LPM operations during periods of lower activity, the amount of changed data can be reduced which will result in faster overall migration. 

 LPM Performance Results
Performance is a crucial part of the overall LPM solution because customers need to meet specific business requirement as to how long it takes to evacuate a server or migrate an individual partition.  As part of LPM development, IBM has various LPM environments set up to measure LPM performance.  We would like to share some performance results from one of these environments to help you estimate your LPM time.  The scenario is a pair of 9117-MMD at FW780, HMC level 830 SP1 and VIOS at level 2.2.3.3.  There are two sets of dual VIOSes (i.e. 4 total VIOSes).  The VIOS is using dedicated 10Gb Ethernet connections and 8Gb fiber channel ports.  The logical partitions (virtual machines) are a mixture of AIX 6.1 and AIX 7.1 images.  The rootvg device for the partitions is configured using virtual SCSI and the application storage is configured with 12 virtual fiber channel connections.  The memory assigned to the partitions is between 8GB and 96GB with an average of 8-16GB.  The logical partitions are configured with various number of virtual processors (4-32) and various entitled capacity (1.0-4.75 CPUs).  While the partitions are being migrated, we are running mixture of DB2 and Websphere applications.  These OLTP-like transactions add up to over a billion transactions in a given day.  Having the applications active during LPM has the effect of increasing the overall migration times.

We used both sets of VIOS pairs to perform the LPM operation.  Each VIOS pair can do 8 concurrent LPM operations so with two pair of VIOSes, we were able to perform 16 concurrent LPM operations.  The total time to move these 16 partitions from one server to another server was 25 minutes with individual migrations between 21 and 24 minutes.

The overall time to migrate the partitions one-after-another would have been slightly longer than doing concurrent migrations because there is some processing time that can be overlapped at the setup and finishing steps of the migration process.  With concurrent migrations, since they are all sharing the same network configuration, the actual time a partition is in the migrating state is longer.  This is due to having 8 simultaneous migrations over the same 10Gb network connection.  A single migration operation is able to saturate a 10Gb connection.  If the time an individual partition is in the migrating state is a concern, you could minimize this time by doing the migrations in a sequentially which would slightly increase the overall migration time.

The results presented were made with typical software levels and typical hardware in use today but your results will vary based on network traffic, partition sizes, software levels and such.  We are currently doing performance testing with the latest hardware, latest software, large partitions (>1.5TB) and high-speed connections (40Gb).  The initial testing is showing better than a 3x improvement over previous 10Gb network results. 

In conclusion, to get the best performance with LPM:
- Use 10Gb or better dedicated network connections
- Configure network options like LRO and Jumbo frames
- Ensure sufficient resources are configured for the VIOS to optimize LPM performance

Contacting the PowerVM Team
Have questions for the PowerVM team or want to learn more?  Follow our discussion group on LinkedIn IBM PowerVM or IBM Community Discussions

0 comments
17 views

Permalink