Virtual memory: staying on under pressure
Memory management in Linux isn’t exactly the stuff of black magic, and it’s now time to get it working the way you want.
T hink back to when you were appointed responsible for a web server. One day (actually, night) you get a wake-up call saying your server is dropping connections. You log in via SSH and quickly find the reason: there’s no web server running in the box. Still, you look in the Apache/nginx/whatever
else logs and see no fatal errors. The process has just disappeared. Something strange is going on here.
If this ever happens to you, look in dmesg. Chances are you’ll see the infamous OOM killer message. This TLA stands for Out Of Memory, and this is the last resort mechanism the kernel employs to grab some memory when it absolutely needs to. Most likely, your web server wasn’t the culprit, but the victim. Why did Linux choose to sacrifice it, and what can you do to avoid yet another wake-up call? In this Administeria, you’ll dig into how Linux manages memory, and which knobs are available for you to tweak this process.
It started in userspace
The bulk of memory management occurs in the kernel, which isn’t the simplest thing in the world. And complex things are best learnt by example, they say. So, consider a process that allocates a 16MB worth of memory: char *buf = (char *)malloc(16 * 1024 * 1024); if (!buf) /* Unlikely to happen */
Think of a relevant construction from your favourite language if you don’t like C. Either way, the process may not have so much memory at hand, and it would need to go to the kernel and ask for some pages (4K blocks, to keep things simple). There are two main syscalls available for this purpose: brk() and mmap() . Whichever is chosen depends on the allocator and how much memory it wants to allocate. brk() grows process heap which is contiguous: one can’t release the first 99MB of a 100MB chunk when they doesn’t need it anymore. mmap() plays such tricks easily, so it’s the preferred way for large memory allocations. In glibc, you convey the exact meaning of “large” via mallopt(3) , and it can be as low as 128K.
So your 16MB chuck would almost certainly come through mmap() . However, this syscall has a rather non-trivial kernel implementation. brk() is much simpler; in fact, the kernel says it’s “a simplified do_
mmap() which only handles anonymous maps”. So, to keep things simple again, let’s assume you’ve set M_ MAP_THRESHOLD so that a 16MB allocation fits in. Then the real work would happen in the do_brk() kernel function, and a few relevant excerpts are shown below: struct mm_struct *mm = current->mm; if (mm->map_count > sysctl_max_map_count) return -ENOMEM; if (security_vm_enough_memory_mm(mm, len >> PAGE_SHIFT)) return -ENOMEM; mm points to a so-called memory descriptor of the current process. Before allowing the memory request to proceed, the kernel does a few checks. First, it ensures the process doesn’t have more memory map areas than the vm.max_map_count sysctl setting permits. The default value is 65535 and you never have to tweak this knob under normal circumstances. If you wonder how many maps a typical process has, check /proc/$PID/
maps or use the pmap command. The kernel validates it has enough free memory available. This is a bit trickier than it sounds, and you’ll see the details in a second.
Then, the kernel “allocates” you some memory. If you wonder why this word comes quoted, look at the code:
struct vm_area_struct *vma, *prev; vma->vm_start = addr; vma->vm_end = addr + len; vma_link(mm, vma, prev, rb_link, rb_parent); mm->total_vm += len >> PAGE_SHIFT; mm->data_vm += len >> PAGE_SHIFT;
When it comes to memory allocations, Linux is a lazy beast. It doesn’t actually assign you any resources, but merely remembers that it’s okay if you touch bytes between addr and addr + len. It also accounts for your request in the total virtual memory size and data segment size for the process. Top, htop and ps can display these counters for you to inspect. So if you see a large value in VIRT or DATA columns in htop, it doesn’t mean that the process really consumes so much memory; it only says the program has requested it. This brings us to another interesting topic…
If a program is unlikely to use all the memory it requested at once, why not let it ask for more memory than what’s available? That’s the basic idea behind memory overcommit. This isn’t the same as swapping where you trade access time for more memory, because infrequently used pages go to disk, not RAM. Overcommitting is a bit like banking: your total debt can be more than you have at any given moment, but the hope is your debtors won’t come after you all at the same time. The kernel implements overcommit checks in the
__vm_enough_memory() function. The double underscore denotes that it’s private, with security_vm_
enough_memory_mm() acting as a public facade. The latter summons Linux Security Modules (LSM) to decide if a particular allocation should be treated as a superuser one.
How Linux decides if it has enough memory depends on the vm.overcommit_memory sysctl setting. The default value is 0. It’s not “disable” as you might think but “guess”, which is somewhat misleading. In this mode, the kernel estimates how many pages it could free if urged to. This includes pages that are free now, the page cache which is shrunk first under pressure, and kernel caches explicitly marked as reclaimable. Shared memory and reserved pages are excluded, and for non-superuser requests, vm.admin_reserve_kbytes of memory (typically 8MB) are also set aside. This is to leave root some breathing room even if the system is low on memory. Otherwise, it would be hard for the admin to spawn a shell and fix it. If the resulting amount is more than requested, Linux thinks it would be able to fulfil the request when the time comes.
Setting vm.overcommit_memory = 2 (“never”) disables overcommit. In this case, the kernel evaluates the request against a static limit. You set one either via
vm.overcommit_kbytes as an absolute value, or with vm.overcommit_ratio , as a percentage of the total RAM. The default is 50 per cent. Swap space is also included with no restrictions. The kernel honours
vm.admins_reserve_kbytes as well, but also reserves three per cent of the process total virtual memory or
vm.user_reserve_kbytes (128MB), whichever is smaller, for ordinary users.
The final option, vm.overcommit_memory = 1 , is the simplest one. It means “always”, and in this case, the request is always granted. This sounds dangerous, but it’s helpful if your program expects large parts of memory to be zero and untouched. This could be the case for some number crunches such as code working with sparse matrices.
Choosing an overcommit strategy is a balancing job. With “guess”, memory allocations rarely fail but processes get killed when you least expect it. With “never”, OOM rarely occurs and the system behaves more predictably, but all programs must prepare for their malloc()s to fail and handle it gracefully.
Getting you some pages
So what would happen if you try to dereference a pointer you’ve just got with malloc() ? In the described scenario, the kernel hasn’t mapped memory to the process address space yet, so the CPU would trigger a page fault. This is a hardware error, much like division by zero, which indicates that you’ve touched memory you don’t have access to. Then, the page fault handler in the kernel comes into play.
Page fault handler is a complex story which begins with the do_page_fault() function. Linux needs to check if the process really stepped into a forbidden or nonallocated area. The VMA structure you saw above is the essential part of it. If it’s the case, the process receives an infamous SIGSEGV. If not, the page is either in the swap or was allocated, but not mapped yet. In other words, it’s time for the kernel to fulfil its promise and get it to you.
Kernel memory allocator is just another complex topic. At the lowest level, it implements the algorithm known as “buddy system” which is exposed via alloc_ pages() and friends. In our case, it’s reached via alloc_
page_vma() , which does what it says: allocates a page per VMA. At the end of the day, all routes come to __
alloc_pages_nodemask() , which is “the heart of the zoned buddy allocator”, as the kernel says in comments.
This function tries to fulfil the request from a free list first, much like malloc() does in the userspace. Free lists track pages that were released recently, and it’s cheapest to get a suitable chunk there if possible. If not, the allocator runs its costly algorithm which is irrelevant for now. But it could also fail if memory is exhausted. If
so, the allocator tries to reclaim some pages. This is the third complex piece of information that you’ve encounter in this article, so it would be enough to say it’s where Linux swaps out unused pages to disk.
Imagine this didn’t help much either, yet the allocation isn’t permitted to fail. Linux has tried everything already, so it unwraps the doomsday device, and calls the out_of_memory() function.
The OOM killer
In fact, there are several ways Linux can get there. Besides failed page allocations, the OOM killer is invoked when processes in a memory cgroup (see
LXF236) hit the limit. Or you can trigger it manually with Alt+SysRq+F: this is how I created the screenshot shown above.
Either way, Linux needs to chose what to kill. Things are simpler if vm.oom_kill_allocating_task sysctl is set: if so, the culprit is also a victim. This won’t work out though if allocating process is either init or a kernel thread, belongs to another cgroup, doesn’t have memory on the required node or is exempted from OOM (more on this in a second). So the kernel needs some heuristics to select “a bad process”.
In short, you want to kill the least important process that takes up a lot of memory. Linux translates this into a numerical value dubbed an “OOM score”. The process with maximum score wins (if you consider this a win, of course) and gets sacrificed. The baseline is the number of pages a process has in its RSS and swap, together with how big its hardware page tables are. Root processes have their score reduced by 3 per cent so they’re less likely to be killed.
It’s important that you can adjust the OOM score manually via /proc/$PID/oom_score_adj . This file stores a value between -1,000 and 1,000, and each one point here counts as if a process had allocated 0.1 per cent of the total pages available. Setting it negative makes the process less likely to be killed, and -1,000 prevents Linux from choosing the process as a victim altogether. This is how you make essential system processes slip off the OOM killer’s hugs of death.
Memory management in Linux isn’t exactly a lightweight read, but with some understanding, it’s certainly possible to make it work the way you want it to. And no one wants another early morning wake-up call on exactly the same problematic topic, do they?
Pmap reveals which regions a process maps, along with their sizes and permission bits. [anon] is where malloc() goes.
Htop is where process-related data meets neat user interface. Did you notice the plasma-shell spans 8GB, but really uses only 217MB?
With Elixir, you can explore Linux kernel sources from within the comfort of your favourite web browser. Great for quick overviews.
OOM killer has just decided one doesn’t really need Plasma Shell to run Kubuntu desktop. These things happen.