Part 1: Buffers and (File Page) Cache
System caches use all available memory to optimize disk IO. Knowing which memory metrics count this usage in their stat is therefore crucial to proper memory analysis.
Congratulations, you’ve won the lottery!
64K is all yours. First, you just have to tell me, how much of that do you want to reject and return to the lottery?
Or perhaps a more realistic scenario:
Congratulations, you’ve landed your dream job working for IBM in their Hybrid Cloud division! CEO Arvind Krishna calls you up and gives you a massive budget. “Don’t disappoint us, we’re counting on you!”
What do you do with the budget? Spend every penny on making the best Hybrid Cloud software? Optimize the company’s expenses, avoiding waste? Tough decisions lie ahead.
But what does prize money and budgets have to do with computer memory?
Memory is both simple and complex. On one hand it’s as simple as counting up all of the bits you can; on the other it’s as complicated as budgeting for a big organization. In my ten plus years of software performance and scale analysis I’ve found a careful balance is needed between the complexities of computer engineering and the elegant simplicity we all desire. We’re constantly being asked the equivalent of how much fuel does it take to get there?
The only answer I can give to that question with 5 nines of reliability is: it depends. Are you driving or flying? What type of vehicle? How much are you transporting? How fast do you plan to drive? We all want to get to the simple answer, 42 (see Douglas Adams for further explanation)! However, to get to there generally requires a deeper understanding, or, at a minimum, an awareness of the complexities and considerations needed to achieve the desired outcome(s).
Going to the Library
Let’s take a step back and continue the analogies to get a better picture of where memory’s role fits in a computer. My preferred metaphor is to imagine that the computer is a library filled with books. The CPU is a busy student, processing and analyzing book after book. Your desk would be the memory, or system RAM, where your books are laid out and open. The bookshelves would be the disk, storing the information even after the library closes for the night and your workspace is cleared. (Hopefully you don’t need an inter-library loan over the network; watch out for the latency!)
As you do your research, reading and comparing between a variety of sources and books, imagine if you had to return a book to its shelf every time you wanted to switch to another book? That would be extremely wasteful and time-consuming. Much wiser to leave all of the books you possibly can out and open on the desk. The bigger the workspace, the more books you can have out, left open on the relevant page. Eventually the desk does get full and you need to make some decisions about which books to clear away.
But it would be foolish to do this prematurely; you’ve claimed the desk, so use it!
File Page Cache
Every operating system’s kernel manages its RAM in a similar fashion to the way I’m suggesting you manage your books. Despite all of the advancements in disk speed and abilities, RAM is still going to be faster and closer to the CPU than the disk. Once something has been read from disk into RAM (a book is pulled from the shelf and placed on the desk), the kernel is going to want to keep that information in RAM as long as it is possible and helpful. This prevents wasteful IO and time, re-reading the same information from disk, over and over. Unused RAM is wasted RAM.
In the end, how much RAM should a system be using? All of it!
Unfortunately, this criteria for using memory is not well known, prompting years of “Help! Linux ate my RAM!” issues as described by my all-time favorite site:
www.linuxatemyram.com/
There is a difference between “used” and “unavailable”. Just because the RAM has something in it, does not mean it is unavailable or you are running low on memory. On a running, healthy, loaded, (properly sized) system we expect the RAM usage to be close to 100%. This is good. We want something populating as much of the RAM as possible to speed up IO and operations.
At a basic level, you can categorize your total memory in 3 ways:
1) Used directly for processes or applications; this is required by the processes for them to run
2) System buffers/cache to help with IO; the kernel may free or return this memory as needed for processes
3) Actually free; nothing is in it (wasted)
Generally, the only time to panic is when your buffers/cache and free approach zero, meaning the memory usage required by processes and applications to run is approaching the maximum.
You must take buffers and cache into consideration when looking at memory measurements. Recognize if a measurement does or does not include the buffers and cache; the value can have significantly different meaning depending on the answer. Remember, the cache can be reclaimed by the kernel if needed, so it is “available”.
But we cannot simply ignore and starve the cache. Every system needs some amount of cache, with databases in particular having a reliance on the cache for a healthy application.
Any memory discussion must also include mentioning the swap space. The swap is space on disk that has been set aside to be used as extra memory. The kernel can use it when memory is full, moving inactive files over to the disk (but still considering it as “memory”). Preferably, the swapped data remains inactive or infrequently used, since the speed on the disk will be significantly slower to access than real memory. In general I would not recommend relying on swap for holding significant data, but have found even a little swap can make stressed systems handle potential OOM scenarios much smoother. As of this writing, in Kubernetes deployments swap usage is not allowed (yet) due to the complexities it can create (see 53533 for discussions and progress).
Memory Measurement Breakdown
Two of the most commonly used command line options for viewing memory are free
and top
. Both use the same breakdown of memory metrics:
1) total (MemTotal): all memory installed or provisioned to the system
2) used (MemUsed): the total memory minus the free and buff/cache
3) free (MemFree): memory without anything in it (wasted)
4) shared: shmem, shared memory segments (often small)
5) buff/cache: the sum of the system buffers (memory for block device IO) and cache (file page cache) usage
6) available (MemAvailable): a more accurate estimate of how much memory is available on the system than simply looking at the free plus buff/cache (see below).
In order to help with the confusion around just how much memory is being “used”, the Linux kernel introduced MemAvailable
.
MemAvailable: An estimate of how much memory is available for starting new applications, without swapping. Calculated from MemFree, SReclaimable, the size of the file LRU lists, and the low watermarks in each zone. The estimate takes into account that the system needs some page cache to function well, and that not all reclaimable slab will be reclaimable, due to items being in use. The impact of those factors will vary from system to system. git.kernel.org
On many systems, the MemAvailable
measurement will be close to the MemTotal
minus MemUsed
. Or conversely, the MemUsed
will be close to the MemTotal
minus MemAvailable
. I like to call this the memUnavailable
, although that’s not an official term I’ve seen.
Sometimes though, MemUsed
vs MemUnavailable
will diverge by several GBs, or large percentages of the total memory. The next section explains why, but feel free to skip it (the details are not something everyone needs to know).
Used vs Available Details (Advanced)
More detailed statistics on memory usage (compared to the free
command) are kept in /proc/meminfo
.
Figure 1: free and /proc/meminfo (abridged)
As you can see, the Cache
in /proc/meminfo
(25,584,276 kB) is not the same as the cache
(30,373,472kB) in free
. First, lets look at what goes into the Buffers
and Cached
in meminfo
. Namely, the active and inactive file (caches), plus the shared memory Shmem
.
(meminfo) Buffers + Cached = Active(file) + Inactive(file) + Shmem
11,448 + 25,584,276 = 12,160,004 + 8,517,192 + 4,918,528 = 25,595,724
To get to the larger buffers
and cache
in free
we add in SReclaimable
. This is the part of the slab
or “in-kernel data structures cache” that can be reclaimed.
(meminfo) Buffers + Cached + SReclaimable = (free) buffers + cache11,448 + 25,584,276 + 4,789,196 = 11,448 + 30,373,472 = 30,384,920
This is shown in the free
source code below.
kb_main_cached = kb_page_cache + kb_slab_reclaimable;
And ultimately results in the mem_used
:
mem_used = kb_main_total - kb_main_free - kb_main_cached - kb_main_buffers;
But the question remains, why and how does mem_used
differ from MemAvailable
?
MemAvailable
uses a more complex formula:
mem_available = (signed long)kb_main_free
- watermark_low
+ kb_inactive_file + kb_active_file
- MIN((kb_inactive_file + kb_active_file) / 2, watermark_low)
+ kb_slab_reclaimable
- MIN(kb_slab_reclaimable / 2, watermark_low);
In practice the watermark_low
is usually relatively small, which is therefore selected by the MIN operations, netting the subtraction of the watermark_low
three times. Both formulas now involve the kb_slab_reclaimable
and cancel each other out when looking for net differences. The kb_main_buffers
is also generally small. So the remaining difference comes down to kb_inactive_file + kb_active_file
in MemAvailable
vs Active(file) + Inactive(file) + Shmem
in the Cached
calculation that results in mem_used
. So the net difference can usually be attributed to the Shmem
or shared memory, plus some for the watermark_low
. See git.kernel.org for more details.
In the end the main takeaway needed is to recognize that some types of memory usage can be reclaimed by the system if needed.
System Examples
Lets look at how this is on a real system. On this worker node we have only 1,013 MB free, yikes! Time to panic and call the perf team, right?!? But wait! There are also 17,771 MB available. Remember, the 287 MB buffers + 15,588 MB cache is mostly available to the kernel to use for processes that need it.
Figure 2: free Memory Statistics
The MemUnavailable
, or MemTotal
minus MemAvailable
, would be 13,632 MB (not shown by free
unfortunately). Notice this is slightly less than the MemUsed
of 14,514 MB. Which is correct? Both are in their own ways, but it is best to focus on the MemUnavailable
statistics as they take more of the complexities of the memory usage into account.
The MemAvailable
calculation is the default used by Kubernetes when looking at memory. For example, oc adm top node
is reporting based on the memory available calculation (worker12 below):
Figure 3: OpenShift Kubernetes top Statistics
Make sure to remember this does not include the cache usage!
Monitoring, Tracking, and Analyzing the Data
Prometheus is an open source tool that collects many different memory statistics needed to get a clear picture of the memory usage. Building on that data, Grafana has useful views and aggregations of this data, such as the Node Exporter Full dashboard. Below is a larger worker, showing that the Cache
will grow over time, as expected, filling most of the total provisioned memory on the system. MemFree
might be empty, but that’s okay.
Figure 4: Grafana Memory Basics
Insights from the Data
In the various examples above the MemFree
is nearly empty, yet MemAvailable
shows a health system with plenty of RAM available. Looking back again at worker12 (figure 2 and 3), we saw 45% of memory is used
; only around 14GiB of the 32GiB provisioned is “unavailable”.
Surely this is wasteful and could be cut back, right? We’ve gone from “Help! Linux ate my RAM!” to “Help! I’ve over-allocated my RAM!”
As always, the answer is, it depends! Your first instinct may be to cut the worker back to 24GiB, or even as low as 16GiB. Only 14GiB is unavailable, so the load should be fine with that budget cut. However, this is ignoring the important roll the buffers and cache play on a system. Remember, applications and databases depend on the cache to improve IO. Removing memory could impact the performance of mission-critical IO operations, slowing response times, queries, tasks etc.
Databases in particular have an insatiable appetite for needing memory. Any part of the data that cannot fit in RAM will require a disk read to access the data. For example, consider a database with 500GB of storage written to disk and 64GB RAM on the system. After process/application usage, some of the most recent/accessed 500GB of data will be able to stay in RAM. But not all!
Where is the tipping point? Again, the answer is, it depends! Factors such as data access characteristics, data activity/age, disk IO speed and performance, query performance expectations, etc all come into play. Every database is going to differ, and many do have built in caches outside the system cache, but in general you’re going to need some cache set aside for every application. And in my experience, databases can easily account for well over half of your application’s hardware requirements.
Later we will dive into examples of how to analyze the cache vs process/application usage, but first we’ll need to learn about some additional memory metrics to better understand the usage.
System Memory Summary
Remember, to a system kernel, unused RAM is wasted RAM. You would never reject lottery winnings! All of it should be used for something.
Memory is also your budget from the CEO; invest every dollar wisely in ways that will help further your goals. Some of the budget is absolutely required to run. On top of that requirement is money for improving and optimizing the organization’s performance. Just because memory is used does not mean it is required; just because memory is not required does not mean it is not used.
Know what your memory statistic is telling you. Does this number include the cache or not? Doing so will help you properly analyze and tune your systems.
To Be Continued…
Up next we will dive into what goes into Kubernetes and containers’ memory.