PowerVM

 View Only

Mechanism to minimize downtime on planned and unplanned outages

By HARIGANESH MURALIDHARAN posted Mon June 08, 2020 06:03 AM

  

In a data center, there might be scenarios where a system outage can happen, either planned or unplanned. Some examples of a planned outage could be for upgrading to a new firmware release, on the system or could be a scheduled power outage or could be for hardware maintenance on the system.  Unplanned outages can be caused by a critical hardware failure. Specific to PowerVM, there could be cases where Virtual IO Server partition, in non-redundant environments, needs to be updated. In all of the cases mentioned above, you would ideally like to avoid or minimize the downtime of your applications. PowerVM provides multiple options to minimize downtime on planned and unplanned outages, by configuring Live Partition Mobility (LPM) and/or Simplified Remote Restart (SRR) during partition creation or runtime, to cover different scenarios which are detailed in this blog.

Planned Outages:

Live Partition Mobility (LPM) can be used in most cases for planned outages, provided you have spare capacity to host the partitions. Lets say, you wish to upgrade your system to a new firmware level. You setup a time window to perform the upgrade, but would not like your existing partitions to be shutdown. In this case, you can use LPM to move the partitions to another system, upgrade the firmware level, restart the VIOS partitions on the original system and then migrate the partitions back to the original system. There is absolutely no downtime to the partitions/applications in this case.

 

Let’s take another scenario where you need to replace hardware on the system that cannot be concurrently replaced. This will require shutdown of the partitions and powering off the system. LPM helps in this case as well as you can migrate the partitions to another system, perform the necessary repair actions on the system and then bring the partitions back to the original system using LPM again.

See the Blog Where do I find LPM Documentation to get started with LPM and find details on best practices.

 

You can also use LPM for workload balancing and to keep the partitions performing at optimal levels. If you are using PowerVC, then using the Dynamic Resource Optimizer feature, you can set preferences in such a way that partition will be migrated by PowerVC automatically based on the server CPU utilization.  

If you are not using a dual Virtual IO Server (VIOS) configuration and the VIOS needs to be shutdown for hardware or software failures, LPM can be used to migrate all or critical partitions to another server and brought back once the VIOS is up on the original server.

 

If you are using a dual VIOS environment and one of the VIOS partitions has failed due to hardware or software failures, you can still use LPM to migrate the partitions. Useful information on LPM with inactive source storage vios can be found in the LPM Enhancements in PowerVM 2.2.4 blog and the HMC migrlpar man page

 

For reducing the downtime during an outage, you can follow the best practices and also enable multiple VIOS partitions to act as Mover Service Partition (MSP), thereby increasing the resiliency and performance of the LPM operations. For more details, refer to the blog on LPM Improvements in PowerVM 2.2.5.

 

Note: The PowerVM LPM/SSR Automation tool (Youtube Tutorial) is ideally suited for evacuating a system for planned maintenance and then restoring the partitions to the original system on completion of the maintenance actions.

Unplanned Outages:

Let’s now discuss unplanned outages, where-in the hypervisor goes down or the system itself goes down completely. In such scenarios, you can restart the partition on another system with setup/enablement done prior to the crash. PowerVM offers the Simplified Remote Restart (SRR) feature which can be used to restart partitions on another system. At times, it might take longer to to bring up the server, in which case remote restart function can be used for faster re-provisioning of the partition. Typically this is faster than restarting the crashed server and restarting the partition(s). Instead the partitions can be automatically restarted on another server and once the crashed system returns to operating state, LPM can be used to return the the partitions to the original system.

SRR can be enabled easily either at the time of partition creation or during runtime. The configuration requirements for SRR are very similar to LPM. Simplified Remote Restart was introduced with Power 8 systems with System firmware level FW 820, HMC V8 R8.2.0 and VIOS 2.2.3.4 or later. If you are using PowerVC, Remote Restart is supported from PowerVC 1.2.3 onwards.

 

SRR for situations when the complete system has an outage (i.e. system state appears as a “No Connection” in the HMC) is supported from HMC V8 R8.5.0 onwards. For more details on HMC Remote Restart, you can refer to the Simplified Remote Restart White Paper and SRR Enhancments and What’s new in HMC V8 R8.6.0 blogs.

 

PowerVC also supports Automated Remote Restart where you can set the preference to automatically restart the partition on a server failure; the support for this feature was introduced in PowerVC 1.3.2

 

PowerVC 1.3.1 introduced support for Remote Restart with NovaLink Managed Systems. Remote Restart support for complete system outage was added in PowerVC 1.3.1.2.


You can refer to blogs on PowerVC Automated Remote RestartPowerVC Remote Restart Deep Dive, PowerVC Remote Restart with NovaLink for more details.

 

In summary, PowerVM provides features to minimize impact for planned and unplanned outages. Depending on your business needs, you can leverage any/all of these.

Contacting the PowerVM Team

Have questions for the PowerVM team or want to learn more?  Follow our discussion group on LinkedIn IBM PowerVM or IBM Community Discussions



#HMC
#PowerVM
#powervmblog
0 comments
24 views

Permalink