Originally posted by: stan_kvasov
Today’s multi-core processors support many threads of execution and can provide substantial performance when running multithreaded applications. Unfortunately, multithreaded programming is difficult, and as a result, a lot of today’s software is still single-threaded. Single threaded applications only exploit a single core and leave a large portion of the processor unutilized.
The current microprocessor architectures pose another challenge – the memory latencies are rising relative to the clock speed of processors. This causes a core to pay a large performance penalty on a cache miss that occurs when a processor has to fetch data from main memory. Fetching a data from the main-memory can cause a stall – the core will wait for the required data to be retrieved.
In the XLC v11 and XLF v13 version of its compiler for AIX, IBM introduced a new feature that addresses both of these performance issues – assist threads. The IBM compiler can generate software prefetch threads that will run on an idle core or a SMT context and prefetch data into the shared cache of the processor. This has two advantages for sequential programs. First, the assist thread runs on an idle hardware context and hence does not slow down the main thread. Second, the data prefetched by the assist thread is re-used by the main application thread. Since the data is prefetched into a shared cache, this reduces the latency of many memory accesses and effectively speeds up the application. Assist threads have been used to get up to 2x performance improvement for applications that have irregular memory access patterns.
To generate an assist thread, the compiler has to be aware of which memory accesses miss in the cache and stall the core. These memory accesses are called delinquent loads. Delinquent loads can be identified by either profiling an application using tools like pmcount or they can be automatically identified by the compiler’s profile-driven-feedback (PDF) facilities. Only delinquent loads inside loops are targeted since they can cause repeated cache misses and have the biggest impact on performance. The compiler performs analysis on each loop with a delinquent load and uses a back-slicing algorithm to prune each loops for address computations. Each reduced loop is used to build the assist thread. Once the thread code is generated, the compiler also instruments both the main and the assist threads to add synchronization between the two. At runtime, the assist thread runs ahead of the main thread and prefetches data into the shared processor cache. The synchronization ensures that the assist thread does not fall back behind the main thread or run too far ahead. Either condition would cause the assist thread to pollute the shared cache and negatively impact the performance.
Join us in the next installment to go through a working example of using assist threads.