View Only

PowerVM Dynamic Platform Optimizer

By Pete Heyrman posted Tue June 16, 2020 01:09 PM


PowerVM DPO Logo
Most multi-socket servers experience Non-Uniform Memory Access (NUMA) effects that can affect performance.  The PowerVM hypervisor will minimize these NUMA affects as much as possible when partitions are created.  Over time, with the use of functions like Dynamic LPAR (DLPAR), the assignment of resources may no longer be optimal.  The Dynamic Platform Optimizer (DPO) is a function provided as part of PowerVM that allows for the dynamic optimization of the resources assigned to active and inactive partitions to minimize these NUMA effects.

Overview of Dynamic Platform Optimizer

When a partition is created/deployed, the PowerVM hypervisor choses the physical resources to assign to the partition.  The hypervisor has information about the DIMMs, processor sockets, I/O devices and such and uses this location information to optimize the resources assigned to partitions.  For example, if a partition is defined with processing resources and memory that can fit within a single processor chip, the hypervisor will attempt to allocate the resources for the partition from a single chip.  Some partitions might need more processors or memory than the capacity of a single chip, so the hypervisor will attempt to contain the partition into a Dual-Chip module (DCM) or a CEC drawer.  Very large partitions might need to have resources allocated across multiple drawers.  In addition to managing processors and memory, the PowerVM hypervisor, when running on Power8 hardware, considers I/O devices when choosing the appropriate resources.  For example, if you created a VIOS partition and it owns a communications adapter and a fiber channel adapter and these adapters were connected through a single processors chip, the hypervisor will attempt to assign the cores and memory from the chip where the I/O devices are connected.  These are some simple examples but many customers have hundreds of partitions, many I/O devices, terabytes of memory and up to 32 processor chips to manage.  Fortunately, the PowerVM hypervisor manages all the assignments of physical resources for a server.

Dynamic LPAR requests can change the assignment of resources for an active partition.  Whenever assigning resources to the partition, the hypervisor considers the current resources assigned to the partition and attempts to utilize free resources from chips already in use by the partition.  For example, if your server has only 2 chips with 4 cores in each chip, a 3 core partition is created and then a request was made to add a core to the existing partition, the hypervisor would assign the last core from the current chip to the partition (assuming it wasn’t already assigned to another partition).  If there are no free resources available on the chips currently in use by the partition, the hypervisor is forced to use resources from another chip, another DCM or another drawer.  When using these resources, there may be additional latency and bandwidth effects that affect the performance of the partition.  For example, if memory is being added to a partition and that memory is allocated on a different drawer than the processors, every reference to the memory has to go through a drawer-to-drawer link in the server.  Caches in the processors mitigate these effects but sometimes these effect can lead to less than ideal performance.

Correcting these cross chip and cross drawer allocations can be handled by the Dynamic Platform Optimizer (DPO).  The optimization is composed of two steps; first is the planning step and the second step is the actual optimization step.  During the planning phase, the hypervisor develops a plan of which physical resources should be assigned to individual partition.  In the optimization stage, the hypervisor will switch the assignment of physical resources.  Processor reassignment can happen very quickly as the hypervisor already virtualization the register contents of partitions.  Memory reassignment involves making a copying the memory contents from one DIMM to another DIMM.  The hypervisor utilizes Power hardware assists to ensure a coherent image of the data is maintained for the active partition.  This copying can take a while to complete based on how much the new plan differs from the current system configuration.

When should DPO be performed
The Dynamic Platform Optimization should be used when there is a performance concern that can be traced to NUMA effects.  One way to determine if you are being affected by these effects is collecting operating system performance reports and looking for remote memory accesses.  The more remote memory accesses that are occurring, the more likely a re-optimization will improve the application performance.

Another way to determine how resources are assigned to partitions would be the lsmemopt HMC Command Line Interface (CLI) command.  The lsmemopt command uses a 100 point scoring system where 100 is ideal assignment of resources and where zero would be terrible assignment of resources.  Within lsmemopt there are two different levels of scoring (system wide versus individual partition scoring) and two different scoring techniques (current score of the server/partitions versus predicted score if you were to perform DPO).

Partition based scores ( lsmemopt –r lpar )
When lsmemopt calculates an individual partition score, the partition is scored based on how close it is to the ideal assignment of resources.  For example, if you have a partition that has processor and memory resources that could be contained within an individual chip and the current partition is actually consuming resources from only one chip, the score will be close to 100.  If on the other hand, the actual resources for this partition were spread across two different chip instead of being assigned from a single chip, the score will be close to 50.  The worst possible assignment of resources would be a situation where the processors were coming from one chip and the memory was from a different chip, in this situation the partition score would be zero.

System-wide score ( lsmemopt –r sys )
The hypervisor uses a weighted average of all of the partitions on the server when reporting a single system-wide score.  The more processors and memory that are assigned to a partition, the greater the partition’s score contributes to the overall system-wide score.  Depending on the configuration of the hardware and the configuration of the partitions, it may not be possible to ever achieve an overall score of 100.

Current score ( lsmemopt –o currscore )
The current score option reports the score based on the current resources assigned to the individual partitions.  As mentioned, this can be reported on and individual partition basis or as a single system-wide score.   The following is sample output from a currscore request:

PowerVM DPO currscore

 Calculated score ( lsmemopt –o calcscore )
The calculated score options will estimate what the score would be if a DPO operation is initiated.  This calculated score option performs the first step in the DPO process where an overall plan is developed on which resources would be assigned to which partition.  All the partitions are then scored with the assumption that all these resources can be reassigned.  This allows for a prediction of whether or not initiating a DPO operation will actual improve the resources assigned to the various partitions and system overall. Individual partition or a single system-wide score can be reported with the calcscore option.  The following is sample output from a calcscore request

PowerVM DPO calcscore

Using the score
The score is a representation of how close a partition or the server is to optimal assignment of resources.  It can be used as a guide to make decisions on the need to run DPO but it cannot be used as a measurement of how much performance will improve if you were to optimize the resource assignment with DPO.  Some application may have a small memory footprint such that the caches in the processors are able to maintain a large amount of the data referenced by the application.  Applications like this can run well even if the resources are spread across multiple chips.  On the other hand, some applications that reference large amounts of data can be very sensitive to the resources assigned to the partition.  For these applications, remote references to memory can noticeably affect performance.  The best use of the lsmemopt scoring is to track the scoring and performance over time.  If you notice that the performance of a partition or the server has degraded over time, you would want to compare the previous score to the current score to determine if resource assignment may have changed which could be affecting the performance.

Understanding the parameters for the HMC optmem and lsmemopt commands
The optmem HMC CLI command is what is used to actually initiate a DPO operation on the server. The lsmemopt and optmem commands have many common parameters that allow you to select the amount of optimization that is being performed.  By default if you do not specify parameters to subset the request, all of the resources in the server and all of the partitions in the server will participate in the optimized solution.

Partition exclusion parameters (-x and --xid)
The –x and –xid allow partitions to be excluded from the optimized solution.  If a partition is excluded, it will not be optimized and will maintain the same resources that were assigned before a DPO operation was initiated.  If neither of the exclusion parameters are specified, all partitions and all resources in the server will be used in the optimization.  Specifying partitions to be excluded can be useful if important partitions already have good resource assignment and there is no reason to optimize these partitions.  You should be somewhat careful about having too many partitions in the exclusion list as the resources assigned to these partitions cannot be used to improve the resources assigned the partitions being optimized.  An extreme example would be if you only wanted to optimize a single partition and you exclude all but the one partition.  In this situation, only the resources already assigned to this single partition and the unassigned resources in the server would be used to generate the optimized plan which may result in little improvement in the resource assignments.  As mentioned before, you could use the -calcscore option to predict the score in conjunction with the exclusion parameters.

The –x and –xid work exactly the same way, they just allow for two different methods of specifying the partitions to be excluded.  You can use the –x version if you prefer to identify partitions by their partition name.  The –xid allows partitions to be identified by the partition id (number).  Note that the    –xid parameter allows for a range of partition ids to be excluded, for example –xid 5-10 would prevent optimization of partitions 5, 6, 7, 8, 9 and 10.

Partition ordering parameters (-p and --id)
The optmem and lsmemopt commands provides an override to the default way the PowerVM hypervisor will prioritize the resource assignment.  If these partition ordering parameters are not specified, the hypervisor will rely on its built-in prioritization.  These rules order dedicated partition ahead of shared partitions, orders higher uncapped weighted partitions ahead of lower uncapped weighted partitions, orders partitions with lots of processors and memory ahead of partitions with fewer resources and such.  If there is a reason to override the default ordering, the –p and –id parameters can be used to move partitions to the front of the optimization list.  A subset of the partitions to be optimized can be specified, for example if you want to optimize all the partitions on the server but want partition 5 to be given the best choice of resources and partitions 7 the second best, you would specify optmem –id 5,7.

When performing a DPO request, at the start of the optimization all of the partitions are put in a special hardware mode that allows the hypervisor to seamlessly add and remove resources from the partition.  There can be some performance effects when running in this mode.  One by one, the hypervisor optimizes the assigned resources from the highest priority partition to the least priority partition.  Once all the desired resources have been assigned to an individual partition, the partition will exit from the special hardware mode and performance should return to usual levels.  Similar to the exclusion parameters, the –p and –id parameters provides a choice in how the partitions are identified for the ordering.  The –p option allows for the identification of the partitions by name and –id allows for the identification of the partition by partition number.

Performance tips
One technique to speeding up the time it takes to perform a DPO operation is to power off any unneeded partition prior to starting the DPO operation.  Since the majority of the time to perform DPO is copying the memory, powering off the partition allows the hypervisor to skip copying the memory contents for these powered off partitions.

Another technique to speed up the DPO time and provide a better optimized solution is to have unlicensed (dark) resources on your server.  Unlicensed processor and memory allows the hypervisor more degrees of freedom in developing a solution.  The hypervisor, when creating the plan for DPO, considers all resources as licensed, creates the plan and then will un-license the appropriate resources when completing the DPO operation.  Having unlicensed resources can improve the overall resource assignments and therefore performance across the partitions and the server.

In summary, there are times where the dynamic platform optimizer can be utilized to improve the performance of Power servers.  Understanding the basic function and utilizing the various options allows the user to tailor dynamic resource optimizations to meet their requirements.

Contacting the PowerVM Team

Have questions for the PowerVM team or want to learn more?  Follow our discussion group on LinkedIn IBM PowerVM or IBM Community Discussions