Instana U

 View Only

Memory Measurements Complexities and Considerations Part 2: Kubernetes and Containers

By RILEY ZIMMERMAN posted Fri July 02, 2021 03:42 AM

  


Previously we looked at memory usage by the system as a whole, focusing on the buffers and cache usage that can grow and shrink as needed to optimize file IO.

But how much cache do we really need? And how can we tell who is using it? For this deeper dive we’re going to focus on the usage within a Kubernetes environments and the containers that run there.

Kubernetes: Container Requests and Limits

In Kubernetes, each container can be deployed with requests and limits for memory and cpu (along with other custom metrics). The requests help the Kubernetes scheduler spread workloads across resources; the limits place maximums or caps on the usage.

When considering these requests and limits you need to keep in mind that memory is an absolute resource, either you have it or you don’t. CPU is a compressible resource that is designed to be shared between many different processes and threads continuously. You don’t get millisecond turns using the memory… Therefore, please note that because CPU and Memory have very different characteristics; any best practices for memory may not apply to CPU requests and limits. CPU throttling adds a whole different set of complexity and considerations to the processor domain.

A worker node must have the requested resource value available in order for a pod to deploy. In this case, “available” is simply based on the sum of all other requests already on the worker. This is a critical point many people do not realize. The scheduler is not looking at actual usage (as long as the node isn’t under extreme eviction triggering memory pressure), only the requested values.

Why? Kubernetes does not know the difference between a pod named “big-cassandra” and a pod named “tiny-api”. It’s up to you the developer/tester/user to help the scheduler know what to expect from the pod once it grows to its full potential. Because of this I often refer to the request as your budget.

For example, if you have a worker with 15GiB of RAM, you can place 15 x 1GB requests on it (there is some system reserve that means you probably only have around 15GiB on a 16GiB worker). You are guaranteed to at least get what you requested.

A limit may be larger than the requested value. You can request 1GiB, but limit at 2GiB. Once you exceed 2GiB, you will be killed. When you set the limit larger than the memory request, you are counting on some level of safe overallocation on worker nodes to handle the extra memory you didn’t ask/budget/request for. This can be dangerous and lead to evictions and/or OOM kills if nodes experience too much memory pressure.

For example, you request 1GiB and limit at 2GiB. You then run 15 of these containers on a 15GiB worker. Your maximum usage potential is 30GiB based on the sum of the limits. Yet the system only has 15GiB; container and process evictions or kills will happen if the total usage goes near the 15GiB total.

But wait! I thought we wanted all of the RAM used on a system? We still do, just in a safe way that allows for containers to optimize their usage while playing nicely with their neighbors. In order to do this we need to look at the memory statistics Kubernetes uses and collects.

Prometheus Kubernetes Memory Stats

Prometheus and cAdvisor gather their container level metrics from the control group (cgroup) memory metrics located in /sys/fs/cgroup/memory/. Most are in memory.stat but some have their own files as documented here.



Figure 5: cgroup Memory

kube_pod_container_resource_requests_memory_byte (Request): The memory request used by the Kubernetes scheduler for deploying pods to nodes.

kube_pod_container_resource_limits_memory_bytes (Limit): The memory limit at which the container will be killed.

container_memory_cache (Cache): Cache usage of the container used to help with (file) IO.

The cache usage is counted within the container’s cgroup, which can be restricted in size by the limit. A containers is not allowed to use all of the memory on the system for the cache, unless you do not set a limit. This may come as a surprise to some people at first (myself included). However, it is working as designed; we want to be able to safely account for our own container’s usage when we limit it. See kubelet counts active page cache against memory.available (maybe it shouldn’t?).

container_memory_rss (RSS): At a high level, memory usage not related to the file cache. RSS is used by the kernel for out of memory (OOM) scores and killing of processes when memory hits the limit.

The technical description for the container RSS metric is not exactly the same as the “resident set size” of a process:

Note: Only anonymous and swap cache memory is listed as part of ‘rss’ stat. This should not be confused with the true ‘resident set size’ or the amount of physical memory used by the cgroup.
‘rss + mapped_file” will give you resident set size of cgroup.
(Note: file and shmem may be shared among other cgroups. In that case, mapped_file is accounted only when the memory cgroup is owner of page cache.) github.com/torvalds/linux

Anonymous memory is “A page of memory that is not associated with a file on a file system.” — kernalnewbies.org. These are the anon stats in the cgroup.

And “Shared pages that have a reserved slot in backing storage are considered to be part of the swap cache.” — kernel.org

Possibly more detail than you need to know. Just remember RSS is basically the non-cache usage.

container_memory_working_set_bytes (WSS): The active memory usage by the container, calculated by subtracting the inactive file usage from the total memory usage.

workingSet := ret.Memory.Usage 
if v, ok := s.MemoryStats.Stats[inactiveFileKeyName]; ok {
if workingSet < v {
workingSet = 0
} else {
workingSet -= v
}
}

The amount of working set memory, this includes recently accessed memory, dirty memory, and kernel memory. Working set is <= “usage”. github.com/google/cadvisor

Net is to remember the active cache is counted as part of the working set. This is important because the working set is what Kubernetes watches and compares to the memory limit, evicting the pod when under pressure.

Working set is a game changer when analyzing memory usage. Now we can view how much is actually, actively being used by a container. While there is no active RSS (active anonymous) or active cache (active file) metric in Prometheus today, we can at least subtract the working set from the total to get the inactive cache (inactive file).

container_memory_mapped_file: Size of any mapped files, including tmpfs (filesystem all in memory) and shmem (memory shared between multiple processes). For mapped files, parts of a file are pulled into the page cache so they can be accessed as an array by the program, optimizing reads and writes. As previously referenced:

Note: file and shmem may be shared among other cgroups. In that case, mapped_file is accounted only when the memory cgroup is owner of page cache.

container_memory_usage_bytes (Total): Total memory usage of a container, regardless of when it was accessed.

container_memory_usage_bytes == 
container_memory_rss + container_memory_cache +
container_memory_swap + container_memory_kernel
github.com/google/cadvisor/issues/1744

Unfortunately, container_memory_kernel is not exposed in Prometheus as of today. Also, note the equation does not include the specialized memory space container_memory_mapped_file, which is already counted in the cache.

Default Kubernetes Metrics

Because of its inclusion of all active memory, working set is the default statistic in Kubernetes. Details are documented in issue 227. When you view your pods with commands such as oc adm top pods, the result is showing the working set. Graphs in OpenShift’s default Grafana metrics views are also working set. So in this case you are getting the active cache usage included in your results.

However, this is not the same as the memory usage shown by node level commands such as oc adm top node. In the node level metrics, as previously explained the memory shown there is based on the MemTotal minus MemAvailable statistic MemUnavailable. This excludes the cache usage.

Summing up your working set will not result in the MemUnavailable. Beyond the difference in cache counting, there is also system usage outside of containers to account for at the node level.

Knowing which you are looking at is critical as the meaning is very different!

To Be Continued…

Up next are real world example measurements of the Kubernetes memory metrics we discussed here.


#rss
#Kubernetes
#wss
#MEMORY

Permalink