Power Management is a trade-off between how quickly the work is done (performance) versus how much power it takes to do the work (energy). It is about saving power in conjunction with optimizing performance under multiple system constraints like thermal envelopes and power delivery limitations. Every watt saved can be converted into either increased performance or lower energy consumption. This blog is a first of a series of blogs that will describe Power Management capabilities of the POWER9 processor.
The total power consumed in electronic devices is a sum of static power and active power. Static power, also know as DC power, is primarily caused due to leakage from (a) current flowing to ground through 'imperfect switches' and (b) analog device termination. It can be approximated as voltage-squared divided by resistance. Active power, also known as AC power, is primarily caused by wire and transistor switching from both clock and data transitions. It can be approximated as multiplication of capacitance, voltage-square, frequency and a switching factor.
The two primary principles used in the POWER9 processor for power management are:
- When something is not being used, turn if off. This is also called as Idle State power management.
- When something is being utilized, slow it down (to save power) or speed it up (to maximize performance), based on the overall goals or policies set in the system.
In this blog we focus on Idle State power management in the POWER9 processor. Idle state power management in the POWER9 processor uses a combination of two key techniques to achieve different levels of power saving and associated latency to enter and exit these levels.
- Clock Gating: This technique saves Active Power by reducing frequency by turning off clocks from specific regions in the processor. This technique has a low latency but has limited power savings.
- Power Gating: This technique saves Static Power (leakage) by reducing voltage from specific regions in the processor. This technique has a high latency as well as overheads to restore state in deeper levels, however, offers higher power savings.
A combination of clock gating and power gating at different granularity in the processor allow the POWER9 processor to architect a number of unique and well-defined idle states. Idle states in the POWER9 processor architecture are known as STOP states. Theoretically speaking, POWER9 architecture supports 16 STOP states with increasing levels of power savings. Lowest level of STOP state is known as STOP0 whereas highest level STOP state is known as STOP15. This is analogues to Intel's C States where number 'n' represents deepness of idle state. Not all STOP states are implemented in the POWER9 architecture. STOP states that need to be implemented are governed by energy efficiency they offer, transition times and latency to enter as well as exit the state, complexity involved and energy management policy at system level. The POWER9 architecture supports the following subset of STOP states:
Unlike the previous architecture of the POWER8 processor, POWER9 processor only supports one instruction to enter idle state. POWER ISA (Instruction Set Architecture) has been revised to remove legacy idle state instructions (nap, sleep, winkle ). A new instruction, 'stop', has been added for this purpose to the ISA. Legacy code using these instruction need its replacement with a small sequence of instructions. In order to enter a valid STOP state, the Hypervisor or OS needs to write the targeted STOP state to a newly introduced Special Purpose Register - PSSCR (Processor Stop State Control Register). It also needs to execute instruction 'stop' on all threads. This simple hardware abstraction, via the POWER9 ISA, allows the Hypervisor or OS to request the core and caches in the POWER9 chip to automatically enter into the deepest idle power saving state possible under the current operating conditions. What level of STOP state to request and when is a choice to be made by the Hypervisor or OS based on a complex combination of various system wide power saving policies like Idle Power Saver Mode, performance and core requirements of the workload being run and other system aspects like thermal constraints, which are beyond the scope of this blog.
An Overview of STOP levels in POWER9:
The description below evaluates each STOP state analyzing its composition, utility and best case power saving. Please note that the latency numbers used in this blog are indicative and taken from idealistic, lab test set ups. Measurements taken in final product could marginally vary based on the test set up, instrumentation used, frequencies and other settings like system policies.
In STOP0 the processor core stops dispatching instructions however the clock and power for the core and cache remain on. STOP0 saves about 15-20% of core power. The core can exit STOP0 state from next instruction.
STOP1 is the most widely used STOP state by PowerVM. It is equivalent to the legacy "Nap" state however, it is more power efficient. STOP1 saves about 25% of power consumed by a core. It is very time efficient to enter and exit STOP1. Both entry and exit is handled completely in hardware with no intervention required by firmware. In this STOP state, both core and cache remain powered on. However, units of the core responsible for instruction fetch and execute are clocked off. Hence, the core stops executing instruction but remain accessible to the service processor and other service engines within the processor chip. It takes few microseconds to exit STOP1 state and does not require assistance from the any of the power management on-chip offload engines.
STOP2 is most widely used Stop state by Linux. It is equivalent to the legacy "Fast Sleep" state. It saves about 50% of power consumed by the core. In this state, both core and cache remain powered but the core becomes inaccessible to the service processor and other power management service engines. Due to its simplicity, it is very efficient to enter and exit STOP2. It involves small intervention from an on-chip offload engine called the Core Management Engine (CME). It takes about 9 microseconds to exit STOP2 state. On PowerVM based systems, both STOP1 and STOP2 can be directly requested by the OS with no involvement by the Hypervisor.
STOP4 is equivalent to the legacy "Deep Sleep" state. In this stop state, the entire core is powered off. As a result, there is some state loss. However, timing facilities are maintained. This is different from POWER8 where at the equivalent idle state the timing facility was not preserved. Since, the core entering STOP4 and other deeper STOP states lose state, it is executed by the Hypervisor. The Hypervisor or equivalent must restore the state while waking up the core. In the wakeup path, after powering up of the core, it needs to go through a lengthy initialization and state restoration steps. STOP4 wakeup latency is in a range of 250 - 350 microseconds. The exact latency depends on a number of factors like frequency of operation, state of core and cache at the time of request, etc. STOP4 transition is managed by firmware. As a result, exiting STOP4 takes significantly more time compared to exiting STOP1 or STOP2. However, STOP4 is able to save about 60% of power consumed by the core. If energy management algorithms see no real work for a core for a significant period of time, the core is ideally suited for STOP4.
STOP5 is quite commonly used STOP state for saving power by PowerVM/AIX based systems. Compared to STOP4, STOP5 does not save any additional power. Instead it adds Workload Optimized Frequency (WOF) to the picture. WOF enables any active running cores to run at a higher frequency by utilizing the energy saved from powered down core. This enables running cores to maximize performance while maintaining thermal stability. WOF uses a complex algorithm and requires coordination between multiple power management service engines which makes STOP5 entry and exit latency higher compared to STOP4.
STOP11 is deepest STOP state implemented in POWER9. It saves the entire core and cache power. STOP11 is a quad level STOP state. It means a standalone core cannot enter STOP11. A quad can enter STOP11 only if all cores within quad can enter STOP5. This enables two L2 units shared by a core pair each and an L3 shared by all four cores to be powered off. As a prerequisite, all power management operations ( Service engines, P-State, WOF ) are turned off on the quad. These action involve coordination between multiple power management service engines. As a consequence, it takes a minimum of 75 microseconds for a quad to enter STOP11. On the wake-up path, cache and core needs to be powered and clocked. Various service engines needs to be restarted. Cores and cache needs to be reinitialized. It therefore takes a minimum of 5ms for a quad to wake-up from STOP11. The longer recovery time makes it inappropriate to use STOP11 for run time operation. Bare-metal Linux systems do not use STOP11. PowerVM based systems use STOP11 for product features like un-licensing cores and capacity on demand. It is also utilized in features like concurrent code initialization and memory preserving initial program load (MPIPL).
- Prem Shanker Jha. (Advisory Engineer, POWER Firmware Development. firstname.lastname@example.org)
- Amit J Tendolkar. (Senior Engineer, POWER Firmware Development. email@example.com)
Contacting the PowerVM Team
Have questions for the PowerVM team or want to learn more? Follow our discussion group on LinkedIn IBM PowerVM