Co-author: @SACHIN SANT
Introduction
The IBM Power family of scale-out and scale-up servers includes workload consolidation platforms that help clients control costs and improve overall performance, availability, and energy efficiency. For more information about Power servers, see IBM Power servers.
In this blog, we introduce the concept of resource group, a new feature introduced with IBM PowerVM hypervisor running on IBM Power11 servers, along with configuration and setup of Resource Groups. Later, the blog also explains some of the important Linux kernel scheduler and scalability changes introduced with Linux releases that are supported on IBM Power servers.
IBM PowerVM virtualization
Before we jump into the resource group concept, here is a quick recap of some of the existing PowerVM features.
PowerVM is an enterprise-class virtualization solution that provides a secure, flexible, and scalable virtualization for Power servers. PowerVM enables logical partitions (LPARs) and server consolidation. Clients can run IBM AIX, IBM i, and Linux operating systems on IBM Power servers with a world-class reliability, high availability (HA), and serviceability capabilities together with the leading performance of the Power platform. This solution provides workload consolidation that helps clients control costs and improves overall performance, availability, flexibility, and energy efficiency. Power servers, which are combined with PowerVM technology, help consolidate and simplify your IT environment.
Shared processors are physical processors whose processing capacity is shared among multiple LPARs. The ability to divide physical processors and share them among multiple LPARs is known as Micro-Partitioning. Partitions with shared processors are either capped or uncapped:
- A capped partition can use CPU cycles up to its entitled capacity. Excess cycles are ceded back to shared pool.
- Uncapped partitions share unused capacity based on a user-defined weighting. The weight scale is 0 - 255. If a partition needs extra CPU cycles, it can use unused capacity in the shared pool.
By default, all physical processors that are not dedicated to specific LPARs are grouped in a shared processor pool. A shared processor pool is a PowerVM technology that you can use to control the amount of processor capacity that partitions can use from the available physical processors in the system. This capability isolates workloads in a shared processor pool and prevents the workload from exceeding an upper limit. This capability is also useful for software license management, where sub-capacity licensing is involved.
Shared processors provide the following benefits:
- Configuration flexibility
- Excess capacity that might be used by other partitions
- Time-sliced sub-processor allocations dispatched according to the demand and the entitled capacity
- Reduced licensing cost and increased production throughput
Up to 64 shared processor pools can be defined on Power servers. A default shared processor pool is automatically defined in the managed system. Each shared processor pool has a maximum processing unit value that is associated with it, as shown in Figure 1.
Figure 1: Multiple shared processor pools
Shared processor pools are also available on IBM Power Virtual Server. Shared processing pools on Power Virtual Server minimize software licensing costs and increase production throughput. Many software providers charge based on the number of available processors. Shared processor pools enable users to combine applications together and define a maximum number of processing units for each pool.
For more information about the benefits of shared processor pools on Power Virtual Server, see reduce licensing costs and increase production throughput with shared processor pools.
IBM Power11 and resource groups
In a shared environment such as cloud or multi-tenant systems, performance degradation can be experienced by some user or applications due to imbalanced resource consumption by others.
With the introduction of Power11 processor-based servers, support for resource Groups (also known as physical processor pool) is introduced. Resource Groups can provide:
- Workload isolation across multiple shared and dedicated partitions
- Noisy neighbor isolation
- Improved proximity of multi-tiered workloads
A resource group is defined as a pool of cores that is isolated from other resource groups in the system. In the current design, the entire system is a single physical processor pool with dedicated VMs, the default shared pool, and defined shared pools that are all within the system-wide physical processor pool.
The Resource Group feature will introduce the ability to define multiple groups on the server. A defined resource group will isolate the cores assigned to it from other defined resource groups and the existing default resource group. From a CPU resource perspective, shared processor virtual machines (VMs) and shared pools will now be contained to the physical pool that they are assigned to.
Each physical pool will have its own virtual shared processor pool. Dedicated VMs will also need the ability to be assigned to a physical pool, so they can contribute their CPU resources to shared processor VMs when the VMs are marked as shared, regardless of their active or in-active status.
Figure 2: Simplified resource group configuration illustration
As shown in Figure 2, resource groups can:
- Provide isolation for assigned cores
- Contain both dedicated and shared processor partitions
- Improve workload isolation
The resource group and Linux subsystems work together to improve performance and reduce system noise, enabling efficient workload execution in isolated environments.
The PowerVM hypervisor and the Linux scheduler collaborate to deliver optimal enterprise-grade performance for diverse workloads on IBM Power servers. Let's explore some of the recent updates to the Linux scheduler on IBM Power.
Linux scheduler on Power and scalability enhancements
Linux on IBM Power is a robust and adaptable computing platform that seamlessly merges the open-source flexibility of Linux with the power, reliability, and scalability of IBM's Power architecture.
IBM Power servers provide unique capabilities to run enterprise Linux distributions with a fully open stack that benefits from the IBM eServer OpenPower ecosystem and efficient cloud-native performance through PowerVM virtualization technology. These servers amplify the reliability, security, and scalability of open-source technology with industry-leading, cloud-native deployment options. Enterprise Linux on IBM Power provides a solid foundation for your open-source hybrid cloud infrastructure, empowering you to modernize applications more efficiently.
Refer to Enterprise Linux on Power to know how Linux can help in modernization journey.
Refer to Linux distributions and virtualization options for Power11 servers for details on supported Linux releases on IBM Power11 servers.
Para-virtualized queued spinlock implementation
The queued spinlock is a synchronization primitive used in the Linux kernel to improve scalability and fairness and reduce cache contention while waiting for a resource to become available. It is a type of spinlock that allows multiple threads to queue up and wait on the same lock, rather than having each thread block and wait for the lock to be released.
The generic implementation was not optimized for the IBM PowerPC architecture and was led to high latency, CPU starvation, and even hard lockups under heavy contention on large systems (up to 256 cores). The key problems stemmed from inefficient load/store operations, excessive coherency traffic, and the lack of mechanisms to yield to the lock owner in oversubscribed environments. These inefficiencies made it difficult for large IBM Power servers to maintain low latency and fair CPU access, especially under virtualized (paravirt) scenarios.
Figure 3: Generic queued spinlock problem
To resolve these issues, the corresponding Linux kernel code was enhanced with a qspinlock implementation specific to PowerPC. PowerPC's LL/SC style atomics using the larx/stcx instructions can be used to express more complex operations, whereby improving coherency performance. It also reduces the number of exclusive stores to the lock word, cutting down on coherency probes that can hurt scalability. The unlock mechanism is changed from atomic to non-atomic, lowering overhead for paravirt locks. Moreover, it adds support for recording the CPU ID of the lock owner, allowing waiting CPUs to yield directly to the owner—an essential feature for minimizing latency in oversubscribed environments. Finally, the new code gives tighter control over lock stealing, which reduces unfairness and contention.
Figure 4: This figure demonstrates the PowerPC qspinlock
As a result, IBM Power servers—especially large-scale or virtualized ones—stand to benefit from lower scheduling latency, fewer starvation issues, and improved scalability. The reduced coherency traffic and more efficient locking mechanisms lead to better CPU utilization and system reliability.
Core level asymmetrical packing enablement
PowerVM–based IBM Power servers configured in shared processor mode present unique scheduling challenges for the Linux kernel. In these environments, the kernel has limited visibility into the physical hardware topology due to the abstraction introduced by the hypervisor. For example, in shared LPAR configurations, certain topology details—such as device-tree properties—may be unavailable, which prevents the kernel from fully understanding the underlying core layout.
Additionally, IBM Power servers often operate in an over-provisioned state, where the number of virtual CPUs (vCPUs) presented to the kernel exceeds the number of underlying physical processor cores. This can lead to suboptimal task placement decisions by the scheduler, especially under light or uneven system loads.
Figure 5: Scheduling challenges in PowerVM system
To improve scheduling behavior in these scenarios, the Linux kernel includes asymmetric packing enhancements specifically designed for PowerVM shared processor environments. These enhancements is focus on aligning the kernel’s scheduling strategy with known PowerVM characteristics. Asymmetric packing preferentially places tasks onto a lower set of physical cores when the system is lightly loaded, improving cache locality, reducing cross-core communication, and enhancing power efficiency.
The implementation also includes awareness of simultaneous multithreading (SMT) domains, allowing the scheduler to better group threads on sibling hardware threads within the same physical core. This further reduces scheduling mismatches and improves overall efficiency.
Figure 6: Core-level asymmetrical packing enablement
By aligning the Linux scheduler’s behavior with the underlying PowerVM scheduling model, the enhanced Linux code significantly improves performance, especially in underutilized or lightly loaded systems. It reduces cross-core communication, better utilizes cache locality, and minimizes power consumption. This leads to fewer context switches, lower latency, and better throughput for workloads running on PowerVM. In real-world use cases, such changes are particularly beneficial in data centers and cloud environments where VM over-provisioning is common.
Fix accuracy of steal time
On PowerVM hypervisor-based systems, stolen time—the time when a virtual processor (VP) is ready but not scheduled by the hypervisor—is tracked using fields in the virtual processor area, specifically nqueue_dispatch_tb and ready_enqueue_tb. These fields are updated by the hypervisor using the time base register, which runs at 512 MHz on PowerPC.
However, the Linux kernel timing infrastructure generally uses nanoseconds as its standard unit of time. This mismatch leads to inaccurate stolen time reporting, since raw time base tick values don’t match the nanosecond granularity expected by the kernel. To solve this, the Linux kernel employs a conversion function, tb_to_ns(), which scales the 512 MHz time base values to nanoseconds. This function ensures that the virtual processor area fields are correctly interpreted in the same time units as the rest of the kernel.
The corresponding Linux kernel code was enhanced such that stolen time measurements are now accurate and consistent with the kernel's timekeeping mechanisms. Experimental validation using a Capped Shared Processor LPAR under heavy load (stress-ng at 100% virtual processor utilization) showed that the converted stolen time values match expected behavior, proving the patch effective. The benefit of this fix is improved accuracy in accounting for stolen time, which is crucial in virtualized environments. Accurate stolen time reporting helps system administrators and performance engineers better understand scheduling delays and over-subscription effects in shared processor environments. It can also lead to better CPU utilization insights, scheduling decisions, and performance tuning on PowerVM platforms.
True / false: Sharing in the scheduler
On large multi-core systems like IBM Power servers with 256 cores and SMT8, significant performance degradation was observed due to frequent access to shared fields in the Linux scheduler, particularly the overutilized and overload flags in the root_domain (rd) structure. These fields reside on the same cacheline, and even in non-energy aware scheduling platforms, unnecessary writes and reads to these fields (such as, in enqueue_task_fair and newidle_balance) caused severe cacheline bouncing. This results in CPU cycles being wasted due to false sharing and frequent cache invalidation across cores, leading to inefficient scheduling and overall performance regression.
Figure 7: True / false sharing in scheduler
To address this, a series of code changes were submitted that introduced two key optimizations. First, it prevents unnecessary writes to rd->overutilized (which is only needed on energy aware scheduling platforms) and rd->overload (which is used in both energy aware scheduling and non-energy aware scheduling systems) by checking whether the value actually needs updating. Second, it refactors access to these flags using helper functions that abstract and control when these shared variables are touched, only doing so when strictly necessary, regardless of energy aware scheduling support for overload, and limiting overutilized usage to energy aware scheduling only platforms.
As a result, although the raw throughput may not drastically improve, the system experiences lower latency, better scalability, and cleaner CPU usage profiles, especially under Independent Software Vendor (ISV) workloads and stress conditions. The improved CPU efficiency helps unlock the performance potential of large symmetric multiprocessing (SMP) systems by eliminating a subtle but impactful bottleneck in the kernel scheduler.
Conclusion
IBM Power11 family introduces high-performance servers delivering reliability, security, and flexibility you can count on. Power11 servers are built to deliver simplified, always-on operations with hybrid cloud flexibility for enterprises to maintain competitiveness in the AI era. With support for autonomous operations, Power11 delivers intelligent performance gains that reduce complexity and improve workload efficiency.
With the introduction of Power11 processor-based servers, support for resource groups is introduced. Among other benefits, resource group provides multi-tenancy isolation in infrastructure and cloud provider environments. Resource Groups capabilities introduced with IBM Power11 provides:
- Improved workload isolation
- Scaling and efficiency consolidation
Additionally, a series of Linux kernel scalability and scheduler optimization enhancements drives better efficiency and performance when running Linux on these new Power11 based enterprise servers. Linux on IBM Power11 servers support up to 256 cores (2048 logical CPUs). These Linux improvements enhance the performance when running on such large systems.
Acknowledgement
We would like to extend heartfelt gratitude to team members Srikar, Shrikanth, Nysal, and Stuart for their invaluable comments and feedback. Their insights have significantly contributed to the improvement of this blog.