One of the most common things required for performance tuning is to see stacks of a program during a slowdown. By understanding where the program is commonly executing, we can infer what to tune or what code to optimize. This post is a deep dive into what a program stack is and how it's gathered. It also covers the importance of frame pointers in creating stacks, why they have been historically disabled, and a recent shift in the industry to consider re-enabling them that will hopefully open new doors in performance tuning in the coming years. These concepts apply to all operating systems but Linux will be used as an example.
Theory
The focus of this post is on native thread stacks which are commonly gathered through native sampling profilers such as Linux perf
. There are also program generated thread stacks such as Java thread dumps. Such thread dumps are often sufficient for basic performance tuning and they don't suffer from the frame pointer issues discussed here; however, there are many cases where such thread dumps either don't see the whole picture or can't be gathered quickly enough so this topic is still important for performance tuning of applications that produce their own thread dump output.
Thread Stacks
A program is usually made of functions which are logical groups of instructions with inputs and outputs. In the following example, the program starts in the main
function and calls the getCubeColume
function. The getCubeVolume
function calculates the volume and returns it to the main
function which then prints the calculation along with some text:
public class Program {
public static void main(String[] args) {
System.out.println("Hello World! The volume of a 3x3x3 cube is: " + getCubeVolume(3));
}
public static int getCubeVolume(int sideLength) {
return sideLength * sideLength * sideLength;
}
}
When getCubeVolume
is ready to return its result, it needs to know how to go back to the main
function at the point where getCubeVolume
was called. A program stack is used to manage this relationship of function executions. A stack is a data structure in computer science that has push
and pop
operations. Pushing something onto a stack puts an item on top of all existing items in the stack. Popping something off of a stack removes the top item in the stack.
A real world example is a stack of dishes. As dishes are ready to be washed, they could be pushed on top of a stack of dishes, and a dishwasher could iteratively pops dishes off the top of the stack to wash them. The order in which the dishes are washed is not necessarily the order in which they were used. It might take a while for the dishwasher to get to the bottom plate as long as new dirty plates are constantly added. In this analogy, the dishwasher is the CPU and this is why the main
function is always in the stack as long as the program is executing. Only after all program instructions have completed will main
be able to complete.
Similarly, a program stack is made of stack frames. Each stack frame represents an executing program method. In the above example, if we paused the program during the getCubeVolume
call, the program stack would be made of two frames: the main
function would be the stack frame at the bottom of the stack, and the getCubeVolume
function would the the stack frame at the top of the stack.
Programs execute in a logical structure called a process which manages memory access, security, and other aspects of a program. Programs have one or more threads which are logical structures that manage what is executing on CPUs. Each thread has a program stack which is an area of memory used to manage function calls. The program stack may also be used for other purposes such as managing temporary variable memory within a function ("local", "stack frame local", or "automatic" variables).
Confusingly, the program stack commonly grows downward in memory. For example, let's say a thread has a stack that is allocated in the memory range 0x100 to 0x200. When the main
function starts executing, let's say after some housekeeping, the stack starts at 0x180. As main
calls getCubeVolume
, the stack will "grow" downward into, for example, 0x150 so that getCubeVolume
uses the memory range 0x150 - 0x180 for itself. When the getCubeVolume
finishes, the stack "pops" by going back from 0x150 to 0x180.
The Stack Pointer
CPUs may have a register that points to the top of the program stack for the currently executing thread. This register is called the stack pointer (SP), extended stack pointer (ESP), register stack pointer (RSP), or other names.
The Instruction Pointer
CPUs may have a register that points to the address of the current execution context of a program. This register is called the instruction pointer (IP), program counter (PC), extended instruction pointer (EIP), instruction address register (IAR), relative instruction pointer (RIP), or other names. Depending on the phase in the CPU's execution, this register may be pointing at the currently executing instruction, or one of the instructions that will be subsequently executed.
The Frame Pointer
CPUs may have a register that points to the bottom of the currently executing stack frame where local variables for that function start. This register is called the frame pointer (FP), base pointer (BP), extended base pointer (EBP), register base pointer (RBP), or other names. Originally, this was used because the only other relative address available was the stack pointer which may be constantly moving as local variables are added and removed. Thus, if a function needed access to a local variable passed into the function, it could just use a constant offset from the frame pointer.
Compilers may perform an optimization called frame pointer omission (FPO) (e.g. with gcc
with -O
or -fomit-frame-pointer
, or by default since GCC 4.6) that uses the frame pointer register as a general purpose register instead and embeds the necessary offsets into the program using the stack pointer to avoid the need for frame pointer offsets.
In the unoptimized case (i.e. without FPO), a common calling convention is for each function to first push the previous function's frame pointer onto its stack, copy the current value of the stack pointer into the frame pointer, and then allocate some space for the function's local variables (Intel Syntax):
push ebp
mov ebp, esp
sub esp, $LOCALS
When the function returns, it will remove all the local stack space it used, pop the frame pointer to the parent function's value, and return to the previous function as well as release the amount of stack used for incoming parameters into this function; for example (Intel Syntax):
mov esp, ebp
pop ebp
ret $INCOMING_PARAMETERS_SIZE
When a function calls another function, any parameters are pushed onto the stack, then the instruction pointer plus the size of two instructions is pushed onto the stack, and then a jump instruction starts executing the new function. When the called function returns, it continues executing at two instructions after the call statement; for example (Intel Syntax):
push 1
push 2
push 3
push eip + 2
jmp getCubeVolume
Call Stack Walking
For diagnostic tools to walk a call stack ("unwind" the stack), in the non-FPO case, the tool simply has to start from the current frame pointer which will allow it to find the pushed frame pointer of the previous function on the stack, and the tool can walk this linked list.
However, if FPO was enabled while compiling, then diagnostic tools generally cannot walk the stack since the frame pointer register is used for general purpose computation:
In some systems, where binaries are built with gcc --fomit-frame-pointer, using the "fp" method will produce bogus call graphs
By default, perf
walks callstacks using the frame pointer register (--call-graph fp
).
As an alternative, if programs are compiled with debugging information in the form of standards such as the DWARF standard and specification that describe detailed information of the program in instances of a Debugging Information Entry (DIE) and particularly the Call Frame Information, then some tools may be able to unwind the stack using this information (e.g. using libdw
and libunwind
). For example, perf
may be run with --call-graph dwarf
:
perf record --call-graph dwarf,65528 -F 99 -a -g -- sleep 60
However, DWARF based call stack walking may be up to 20% or much more slower than frame pointer based call stack walking.
Another alternative when running on Intel Haswell and newer CPUs is the Last Branch Record (LBR) capability which is generally less overhead than DWARF; for example, perf
may be run with --call-graph lbr
:
perf record --call-graph lbr -F 99 -a -g -- sleep 60
However, LBR has a limited depth (usually 8 or 16).
In addition, while programs such as perf
support DWARF and LBR, other diagnostic programs such as eBPF currently do not.
Finally, it's worth noting that these problems usually don't affect core dump or crash analysis since performance overhead is not a concern there so technologies like DWARF may be used.
The History of FPO
Historically, FPO brought large performance gains on older CPUs such as x86 that have few registers. Performance gains of FPO could be in the range of a few percent to tens of percent. As the industry moved into 64-bit architectures such as x64, the newer CPUs had many more registers; however, FPO still often shows benefits of 1-2%. In addition to the extra general purpose register and a few less instructions to execute, function bodies are simply smaller and thus CPU memory cache hits are higher. Due to these performance benefits, and inertia, FPO became the de facto standard, particularly after GCC made it the default some years ago.
Therefore, native stack profilers and problem determination tools may produce truncated stacks thus limiting performance analysis and diagnostic capabilities.
Industry Shifting?
Some industry leaders saw that we were leaving a lot of potential performance improvements on the table because it was too difficult or costly to understand what a program was doing. As Brendan Gregg notes, it's not uncommon to find massive bottlenecks that are hundreds of percent if low-overhead, high quality stack samples can be gathered frequently:
We need frame pointers enabled by default because of performance. Enterprise environments are monitored, continuously profiled, and analyzed on a regular basis, so this capability will indeed be put to use. It enables a world of debugging and new performance tools, and once you find a 500% perf win you have a different perspective about the <1% cost.
Recently, large companies such as Google, Meta, and Netflix have shifted core workloads to re-compiled operating system distributions without FPO.
The counter-argument was that taking a ≈1-2% hit was unreasonable for customers that aren't actively watching their systems with such care. The counter-counter-argument was that, first, the ≈1-2% may be an overstatement with realistic workloads on modern CPUs (with some pathological exceptions), and, second, while it's true that few customers will actively do such deep dives, most customers use third party monitoring tools, and those tools will likely add such a capability if the operating system distribution makes it available.
This culminated in the recent and controversial change by Fedora to disable FPO starting in Fedora 38. Fedora is the upstream of Red Hat Enteprise Linux (RHEL), so if this change sticks, this could end up in the next major release of RHEL, and other Linux distributions would likely follow.
It's worth noting that a Linux distribution disabling FPO doesn't necessarily mean all applications will have FPO disabled. For example, IBM Java and Semeru still compile with FPO and many other applications might as well (since the default optimizations enable FPO). If this change sticks and the benefits start rolling in, I expect most enterprise software will, slowly but surely, start to disable FPO, and we might even see GCC disabling FPO on CPUs with a lot of registers. However, it's much easier to re-compile an application such as Java without FPO and re-run a test as compared to re-compiling an entire Linux distribution, so while the changes at a Linux distribution level are not total, they are a massive improvement and signal a shift in the industry.
Summary
In summary, having a low overhead way to gather native thread stacks can bring performance improvements of hundreds of percent and open up new worlds of problem determination capabilities with tools such as eBPF. Historically, a compiler performance optimization called frame pointer omission (FPO) has largely bottlenecked these potentials. Combined with modern CPUs that have more general purpose registers, and under realistic production observations, many in the industry are re-considering frame pointer omission. Large industry leaders such as Google, Meta, and Netflix have already moved over to re-compiling operating system distributions without FPO and Fedora 38 is prototyping this as a new default. If the change sticks, we can hope to see whole new worlds of performance tuning and problem determination potential. Applications that don't ship with operating systems will likely lag by some time but will likely follow and/or compilers may change their defaults. It's an exciting potential shift in the performance tuning and problem determination landscape.
#automation-portfolio-specialists-app-platform #Java #performance