Part 1: Buffers and (File Page) Cache
System caches use all available memory to optimize disk IO. Knowing which memory metrics count this usage in their stat is therefore crucial to proper memory analysis.
Congratulations, you’ve won the lottery!
64K is all yours. First, you just have to tell me, how much of that do you want to reject and return to the lottery?
Or perhaps a more realistic scenario:
Congratulations, you’ve landed your dream job working for IBM in their Hybrid Cloud division! CEO Arvind Krishna calls you up and gives you a massive budget. “Don’t disappoint us, we’re counting on you!”
What do you do with the budget? Spend every penny on making the best Hybrid Cloud software? Optimize the company’s expenses, avoiding waste? Tough decisions lie ahead.
But what does prize money and budgets have to do with computer memory?
Memory is both simple and complex. On one hand it’s as simple as counting up all of the bits you can; on the other it’s as complicated as budgeting for a big organization. In my ten plus years of software performance and scale analysis I’ve found a careful balance is needed between the complexities of computer engineering and the elegant simplicity we all desire. We’re constantly being asked the equivalent of how much fuel does it take to get there?
The only answer I can give to that question with 5 nines of reliability is: it depends. Are you driving or flying? What type of vehicle? How much are you transporting? How fast do you plan to drive? We all want to get to the simple answer, 42 (see Douglas Adams for further explanation)! However, to get to there generally requires a deeper understanding, or, at a minimum, an awareness of the complexities and considerations needed to achieve the desired outcome(s).
Going to the Library
Let’s take a step back and continue the analogies to get a better picture of where memory’s role fits in a computer. My preferred metaphor is to imagine that the computer is a library filled with books. The CPU is a busy student, processing and analyzing book after book. Your desk would be the memory, or system RAM, where your books are laid out and open. The bookshelves would be the disk, storing the information even after the library closes for the night and your workspace is cleared. (Hopefully you don’t need an inter-library loan over the network; watch out for the latency!)
As you do your research, reading and comparing between a variety of sources and books, imagine if you had to return a book to its shelf every time you wanted to switch to another book? That would be extremely wasteful and time-consuming. Much wiser to leave all of the books you possibly can out and open on the desk. The bigger the workspace, the more books you can have out, left open on the relevant page. Eventually the desk does get full and you need to make some decisions about which books to clear away.
But it would be foolish to do this prematurely; you’ve claimed the desk, so use it!
File Page Cache
Every operating system’s kernel manages its RAM in a similar fashion to the way I’m suggesting you manage your books. Despite all of the advancements in disk speed and abilities, RAM is still going to be faster and closer to the CPU than the disk. Once something has been read from disk into RAM (a book is pulled from the shelf and placed on the desk), the kernel is going to want to keep that information in RAM as long as it is possible and helpful. This prevents wasteful IO and time, re-reading the same information from disk, over and over. Unused RAM is wasted RAM.
In the end, how much RAM should a system be using? All of it!
Unfortunately, this criteria for using memory is not well known, prompting years of “Help! Linux ate my RAM!” issues as described by my all-time favorite site:
There is a difference between “used” and “unavailable”. Just because the RAM has something in it, does not mean it is unavailable or you are running low on memory. On a running, healthy, loaded, (properly sized) system we expect the RAM usage to be close to 100%. This is good. We want something populating as much of the RAM as possible to speed up IO and operations.
At a basic level, you can categorize your total memory in 3 ways:
1) Used directly for processes or applications; this is required by the processes for them to run
2) System buffers/cache to help with IO; the kernel may free or return this memory as needed for processes
3) Actually free; nothing is in it (wasted)
Generally, the only time to panic is when your buffers/cache and free approach zero, meaning the memory usage required by processes and applications to run is approaching the maximum.
You must take buffers and cache into consideration when looking at memory measurements. Recognize if a measurement does or does not include the buffers and cache; the value can have significantly different meaning depending on the answer. Remember, the cache can be reclaimed by the kernel if needed, so it is “available”.
But we cannot simply ignore and starve the cache. Every system needs some amount of cache, with databases in particular having a reliance on the cache for a healthy application.
Any memory discussion must also include mentioning the swap space. The swap is space on disk that has been set aside to be used as extra memory. The kernel can use it when memory is full, moving inactive files over to the disk (but still considering it as “memory”). Preferably, the swapped data remains inactive or infrequently used, since the speed on the disk will be significantly slower to access than real memory. In general I would not recommend relying on swap for holding significant data, but have found even a little swap can make stressed systems handle potential OOM scenarios much smoother. As of this writing, in Kubernetes deployments swap usage is not allowed (yet) due to the complexities it can create (see 53533 for discussions and progress).
Memory Measurement Breakdown
Two of the most commonly used command line options for viewing memory are
top. Both use the same breakdown of memory metrics:
1) total (MemTotal): all memory installed or provisioned to the system
2) used (MemUsed): the total memory minus the free and buff/cache
3) free (MemFree): memory without anything in it (wasted)
4) shared: shmem, shared memory segments (often small)
5) buff/cache: the sum of the system buffers (memory for block device IO) and cache (file page cache) usage
6) available (MemAvailable): a more accurate estimate of how much memory is available on the system than simply looking at the free plus buff/cache (see below).
In order to help with the confusion around just how much memory is being “used”, the Linux kernel introduced
MemAvailable: An estimate of how much memory is available for starting new applications, without swapping. Calculated from MemFree, SReclaimable, the size of the file LRU lists, and the low watermarks in each zone. The estimate takes into account that the system needs some page cache to function well, and that not all reclaimable slab will be reclaimable, due to items being in use. The impact of those factors will vary from system to system. git.kernel.org
On many systems, the
MemAvailable measurement will be close to the
MemUsed. Or conversely, the
MemUsed will be close to the
MemAvailable. I like to call this the
memUnavailable, although that’s not an official term I’ve seen.
MemUnavailable will diverge by several GBs, or large percentages of the total memory. The next section explains why, but feel free to skip it (the details are not something everyone needs to know).
Used vs Available Details (Advanced)
More detailed statistics on memory usage (compared to the
free command) are kept in
Figure 1: free and /proc/meminfo (abridged)
As you can see, the
/proc/meminfo (25,584,276 kB) is not the same as the
cache (30,373,472kB) in
free. First, lets look at what goes into the
meminfo. Namely, the active and inactive file (caches), plus the shared memory
(meminfo) Buffers + Cached = Active(file) + Inactive(file) + Shmem
11,448 + 25,584,276 = 12,160,004 + 8,517,192 + 4,918,528 = 25,595,724
To get to the larger
free we add in
SReclaimable. This is the part of the
slab or “in-kernel data structures cache” that can be reclaimed.
(meminfo) Buffers + Cached + SReclaimable = (free) buffers + cache11,448 + 25,584,276 + 4,789,196 = 11,448 + 30,373,472 = 30,384,920
This is shown in the
free source code below.
kb_main_cached = kb_page_cache + kb_slab_reclaimable;
And ultimately results in the
mem_used = kb_main_total - kb_main_free - kb_main_cached - kb_main_buffers;
But the question remains, why and how does
mem_used differ from
MemAvailable uses a more complex formula:
mem_available = (signed long)kb_main_free
+ kb_inactive_file + kb_active_file
- MIN((kb_inactive_file + kb_active_file) / 2, watermark_low)
- MIN(kb_slab_reclaimable / 2, watermark_low);
In practice the
watermark_low is usually relatively small, which is therefore selected by the MIN operations, netting the subtraction of the
watermark_low three times. Both formulas now involve the
kb_slab_reclaimable and cancel each other out when looking for net differences. The
kb_main_buffers is also generally small. So the remaining difference comes down to
kb_inactive_file + kb_active_file in
Active(file) + Inactive(file) + Shmem in the
Cached calculation that results in
mem_used. So the net difference can usually be attributed to the
Shmem or shared memory, plus some for the
watermark_low. See git.kernel.org for more details.
In the end the main takeaway needed is to recognize that some types of memory usage can be reclaimed by the system if needed.
Lets look at how this is on a real system. On this worker node we have only 1,013 MB free, yikes! Time to panic and call the perf team, right?!? But wait! There are also 17,771 MB available. Remember, the 287 MB buffers + 15,588 MB cache is mostly available to the kernel to use for processes that need it.
Figure 2: free Memory Statistics
MemAvailable, would be 13,632 MB (not shown by
free unfortunately). Notice this is slightly less than the
MemUsed of 14,514 MB. Which is correct? Both are in their own ways, but it is best to focus on the
MemUnavailable statistics as they take more of the complexities of the memory usage into account.
MemAvailable calculation is the default used by Kubernetes when looking at memory. For example,
oc adm top node is reporting based on the memory available calculation (worker12 below):
Figure 3: OpenShift Kubernetes top Statistics
Make sure to remember this does not include the cache usage!
Monitoring, Tracking, and Analyzing the Data
Prometheus is an open source tool that collects many different memory statistics needed to get a clear picture of the memory usage. Building on that data, Grafana has useful views and aggregations of this data, such as the Node Exporter Full dashboard. Below is a larger worker, showing that the
Cache will grow over time, as expected, filling most of the total provisioned memory on the system.
MemFree might be empty, but that’s okay.
Figure 4: Grafana Memory Basics
Insights from the Data
In the various examples above the
MemFree is nearly empty, yet
MemAvailable shows a health system with plenty of RAM available. Looking back again at worker12 (figure 2 and 3), we saw 45% of memory is
used; only around 14GiB of the 32GiB provisioned is “unavailable”.
Surely this is wasteful and could be cut back, right? We’ve gone from “Help! Linux ate my RAM!” to “Help! I’ve over-allocated my RAM!”
As always, the answer is, it depends! Your first instinct may be to cut the worker back to 24GiB, or even as low as 16GiB. Only 14GiB is unavailable, so the load should be fine with that budget cut. However, this is ignoring the important roll the buffers and cache play on a system. Remember, applications and databases depend on the cache to improve IO. Removing memory could impact the performance of mission-critical IO operations, slowing response times, queries, tasks etc.
Databases in particular have an insatiable appetite for needing memory. Any part of the data that cannot fit in RAM will require a disk read to access the data. For example, consider a database with 500GB of storage written to disk and 64GB RAM on the system. After process/application usage, some of the most recent/accessed 500GB of data will be able to stay in RAM. But not all!
Where is the tipping point? Again, the answer is, it depends! Factors such as data access characteristics, data activity/age, disk IO speed and performance, query performance expectations, etc all come into play. Every database is going to differ, and many do have built in caches outside the system cache, but in general you’re going to need some cache set aside for every application. And in my experience, databases can easily account for well over half of your application’s hardware requirements.
Later we will dive into examples of how to analyze the cache vs process/application usage, but first we’ll need to learn about some additional memory metrics to better understand the usage.
System Memory Summary
Remember, to a system kernel, unused RAM is wasted RAM. You would never reject lottery winnings! All of it should be used for something.
Memory is also your budget from the CEO; invest every dollar wisely in ways that will help further your goals. Some of the budget is absolutely required to run. On top of that requirement is money for improving and optimizing the organization’s performance. Just because memory is used does not mean it is required; just because memory is not required does not mean it is not used.
Know what your memory statistic is telling you. Does this number include the cache or not? Doing so will help you properly analyze and tune your systems.
To Be Continued…
Up next we will dive into what goes into Kubernetes and containers’ memory.