Hi all,
There are ongoing works in the Linux community to address contention seen in the kernel when running large threaded processes.
A lock is a great tool to address concurrent access to a shared resource on a multi-threaded system. There are several types of locks or semaphores and large literature about that. Using lock on a small system can be quite invisible because the contention is limited to the number of active CPUs.
With new systems running thousands of CPUs, introducing a lock in a hot path could have dramatic imapct and kernel developers are now trying to avoid that and to rely on lockless mechanisms like RCU. But using lockless mechanism is complex and harder to maintain. Furthermore, some part of the kernel have been written a long time ago and may not scale well on large systems, and are difficult to migrate to a lockless mechanism without rewriting sensible parts of the kernel.
One of the major lock (or semaphore) in the kernel is the mmap_lock (https://elixir.bootlin.com/linux/v6.0-rc7/source/include/linux/mm_types.h#L557). This read-write lock is protecting the memory layout of a process against its thread's concurrent access. With a process dealing with thousands of threads, this is becoming a real bottleneck. But not only on large system, Android vendors are also trying to work around this contention to make their multithreaded applications starting faster.
There have been multiple attempts to address this contention, and the kernel memory community is trying to address this issue for circa a decade. The major constraint is to not add complexity to an already too complex part of the kernel, the memory management.
One option is currently in the way to address a part of the issue, the page fault handling, by providing range locking, allowing a thread to protect a range of the process's memory layout. This has been discussed in the last LSF/MM conference and presented this year at the Linux Plumbers Conference [1]:
As part of the IBM kernel team, I'm deeply involved in the discussion about this series and tracking the benefit on large PowerPC system. The main goal is to improve SAP Hana performance at database loading time, and perhaps at runtime too.
Laurent Dufour
1: LPC 2022 - Kernel Memory Management MC
https://lpc.events/event/16/contributions/1271/attachments/936/1844/LPC2022_mmap_lock_scalability.pdf
------------------------------
Laurent Dufour
------------------------------