PowerVM

How Live Partition Mobility is Tested

By THIRUKUMARAN VASANTHA THANANJAYAN posted Wed June 17, 2020 01:27 PM

  
PowerVM LPM LogoIntroduction
IBM has multiple test organizations with their own specific missions.  In the System Assurance organization there is a Software Test team, Hardware Test team, and Storage Test Team.  The Software Test mission includes verification of Operation Systems, VIOS, and Management Consoles.  Hardware System Test mission includes IO adapters, Firmware, Hypervisor, and Serviceability.  Besides the System Assurance team there are development Functional Test teams within AIX, IBM I, Linux, VIOS, Firmware, PowerVM Hypervisor, and Management Consoles with their respective test missions.  Live Partition Mobility (LPM) is a critical PowerVM function; customers are increasingly reliant on it to avoid downtime during Hardware Maintenance, Firmware Upgrades, etc.  What makes LPM unique is its interaction between Hypervisor, Operating Systems, Firmware, HMC, VIOS, Storage and therefore all IBM Test teams are involved in testing LPM.
 

LPM Testing Coverage Consideration
LPM provides the customer with a sizeable number of options and the LPM Test Procedures cycles through combinations listed in the graphic below
PowerVM LPM Test Environment

Our LPM test environment (a.k.a. LPM Zone) spreads across three different sites (Austin, Poughkeepsie and Guadalajara) encompassing 50 systems that range from POWER6 to the latest generation systems.

LPM Testing Procedure
Good Path Testing
Systems are setup to address all the attributes mentioned in the graphic above. And within each of the components tested we also test variations, for example: LPM operations between different VIOS, LPM operations with different settings within VIOS levels (concurrency levels, security profiles, etc.).  Combinations of Firmware levels are included.  Testing LPM for new Firmware Release translates to more than 100,000 migrations.

Bad Path Testing
This is where we inject errors or test failure scenarios. The objective of Bad Path Testing is to make sure that LPM failures can be recovered without any issues.
Different scenarios include (but not limited to):
  • Stop LPM migration when in progress
  • VIOS Network cable pull (Source / Target)
  • VIOS FC cable pull (Source / Target)
  • VIOS Reboot (in Dual VIOS) (Source / Target)
  • MSP Failover verification
  • Inject error in IO Adapter during LPM
  • Migrate with invalid NPIV mapping (Single port or no Target mapping)
  • Migrate unsupported (in target) Operating System
  • Migration from Dual VIOS system to Single VIOS system and vice versa
  • vNIC – cable pull to trigger Failover
  • vNIC – cable reconnect to trigger Fallback
  • Cable pull in Link Aggregation
  • Memory error injection during Migration
  • vLAN Bridge / Redundant VIOS / MPIO / Redundant vNIC Backing Device / Redundant MSP overrides
 
Stress Testing
LPM is done continuously between systems, in back to back loops, with periodically other testing done in between.

Sample scenarios:
  • 5 days of continuous LPM loops
  • DLPAR in between LPM loops
  • DPO (Affinity Optimization) in between LPM loops
  • Concurrent Code Update / Reject in between LPMs
  • Hibernate – Migrate – Resume in loop (Note: Hibernate will not be supported on POWER9)
  • Migrate Large Memory Partition (20TB)
  • Migrate – Remote Restart in loop

Follow-ups from LPM Testing & Field Experience

Our goal is to uncover and address all LPM issues to avoid customer impact. Since the last P8 Firmware release we have been able to gauge the LPM success rate by our customers by using statistics collected by Call Home. This graph shows how the LPM improvements incorporated in recent Firmware levels had a positive effect on the LPM success rate.  Success and fail rates calculations include both LPM Validations and actual migrations.

PowerVM LPM Testing Watson
With the breath of testing that LPM entails, we are constantly testing LPM within IBM on existing and planned future releases of System Firmware, VIOS, HMC, PowerVC or NovaLink.  In rare cases an LPM issue escapes to the field.  When a new LPM issue is encountered (in the field or within IBM) the fix is verified against the originally failing configuration. Thereafter our development team does a through analysis to determine when the problem was introduced.  If it is determined that the problem can occur in a publicly available release, the fix is included in the next scheduled service pack for each active impacted release.  Net service packs include fixes for issues uncovered in our testing of follow-on releases as well as fixes for any field reported issues. 

Conclusion
LPM is one of the most important virtualization feature of PowerVM. As a result, we dedicate significant effort & resources to ensure that LPM works as expected and successfully recovers in case of failures. Keeping your Power Systems IT environment current on latest recommended Service Packs, APARs, PTFs, etc. (for Firmware, VIOS, HMC, OSes...) will maximize LPM success rates.

Contacting the PowerVM Team
Have questions for the PowerVM team or want to learn more?  Follow our discussion group on LinkedIn IBM PowerVM or IBM Community Discussions


#powervmblog
#powervm
#powervmlpm
0 comments
32 views

Permalink