Vir­tual mem­ory: stay­ing on un­der pres­sure

Mem­ory man­age­ment in Linux isn’t ex­actly the stuff of black magic, and it’s now time to get it work­ing the way you want.

Linux Format - - Tutorials -

T hink back to when you were ap­pointed re­spon­si­ble for a web server. One day (ac­tu­ally, night) you get a wake-up call say­ing your server is drop­ping con­nec­tions. You log in via SSH and quickly find the rea­son: there’s no web server run­ning in the box. Still, you look in the Apache/ng­inx/what­ever

else logs and see no fa­tal er­rors. The process has just dis­ap­peared. Some­thing strange is go­ing on here.

If this ever hap­pens to you, look in dmesg. Chances are you’ll see the in­fa­mous OOM killer mes­sage. This TLA stands for Out Of Mem­ory, and this is the last re­sort mech­a­nism the ker­nel em­ploys to grab some mem­ory when it ab­so­lutely needs to. Most likely, your web server wasn’t the cul­prit, but the vic­tim. Why did Linux choose to sac­ri­fice it, and what can you do to avoid yet another wake-up call? In this Ad­min­is­te­ria, you’ll dig into how Linux man­ages mem­ory, and which knobs are avail­able for you to tweak this process.

It started in userspace

The bulk of mem­ory man­age­ment oc­curs in the ker­nel, which isn’t the sim­plest thing in the world. And com­plex things are best learnt by ex­am­ple, they say. So, con­sider a process that al­lo­cates a 16MB worth of mem­ory: char *buf = (char *)mal­loc(16 * 1024 * 1024); if (!buf) /* Un­likely to hap­pen */

Think of a rel­e­vant con­struc­tion from your favourite lan­guage if you don’t like C. Ei­ther way, the process may not have so much mem­ory at hand, and it would need to go to the ker­nel and ask for some pages (4K blocks, to keep things sim­ple). There are two main syscalls avail­able for this pur­pose: brk() and mmap() . Which­ever is cho­sen de­pends on the al­lo­ca­tor and how much mem­ory it wants to al­lo­cate. brk() grows process heap which is con­tigu­ous: one can’t re­lease the first 99MB of a 100MB chunk when they doesn’t need it any­more. mmap() plays such tricks eas­ily, so it’s the pre­ferred way for large mem­ory al­lo­ca­tions. In glibc, you con­vey the ex­act mean­ing of “large” via mal­lopt(3) , and it can be as low as 128K.

So your 16MB chuck would al­most cer­tainly come through mmap() . How­ever, this syscall has a rather non-triv­ial ker­nel im­ple­men­ta­tion. brk() is much sim­pler; in fact, the ker­nel says it’s “a sim­pli­fied do_

mmap() which only han­dles anony­mous maps”. So, to keep things sim­ple again, let’s as­sume you’ve set M_ MAP_THRESHOLD so that a 16MB al­lo­ca­tion fits in. Then the real work would hap­pen in the do_brk() ker­nel func­tion, and a few rel­e­vant ex­cerpts are shown be­low: struct mm_struct *mm = cur­rent->mm; if (mm->map_­count > sysctl_­max_map_­count) re­turn -ENOMEM; if (se­cu­ri­ty_vm_e­nough_mem­o­ry_mm(mm, len >> PAGE_SHIFT)) re­turn -ENOMEM; mm points to a so-called mem­ory de­scrip­tor of the cur­rent process. Be­fore al­low­ing the mem­ory re­quest to pro­ceed, the ker­nel does a few checks. First, it en­sures the process doesn’t have more mem­ory map ar­eas than the vm.max_map_­count sysctl set­ting per­mits. The de­fault value is 65535 and you never have to tweak this knob un­der nor­mal cir­cum­stances. If you won­der how many maps a typ­i­cal process has, check /proc/$PID/

maps or use the pmap com­mand. The ker­nel val­i­dates it has enough free mem­ory avail­able. This is a bit trick­ier than it sounds, and you’ll see the de­tails in a sec­ond.

Then, the ker­nel “al­lo­cates” you some mem­ory. If you won­der why this word comes quoted, look at the code:

struct vm_area_struct *vma, *prev; vma->vm_s­tart = addr; vma->vm_end = addr + len; vma_link(mm, vma, prev, rb_link, rb_­par­ent); mm->to­tal_vm += len >> PAGE_SHIFT; mm->data_vm += len >> PAGE_SHIFT;

When it comes to mem­ory al­lo­ca­tions, Linux is a lazy beast. It doesn’t ac­tu­ally as­sign you any re­sources, but merely re­mem­bers that it’s okay if you touch bytes be­tween addr and addr + len. It also ac­counts for your re­quest in the to­tal vir­tual mem­ory size and data seg­ment size for the process. Top, htop and ps can dis­play these coun­ters for you to in­spect. So if you see a large value in VIRT or DATA col­umns in htop, it doesn’t mean that the process re­ally con­sumes so much mem­ory; it only says the pro­gram has re­quested it. This brings us to another in­ter­est­ing topic…

Mem­ory over­com­mit

If a pro­gram is un­likely to use all the mem­ory it re­quested at once, why not let it ask for more mem­ory than what’s avail­able? That’s the ba­sic idea be­hind mem­ory over­com­mit. This isn’t the same as swap­ping where you trade ac­cess time for more mem­ory, be­cause in­fre­quently used pages go to disk, not RAM. Over­com­mit­ting is a bit like bank­ing: your to­tal debt can be more than you have at any given mo­ment, but the hope is your debtors won’t come af­ter you all at the same time. The ker­nel im­ple­ments over­com­mit checks in the

__vm_e­nough_mem­ory() func­tion. The dou­ble un­der­score de­notes that it’s pri­vate, with se­cu­ri­ty_vm_

enough_mem­o­ry_mm() act­ing as a pub­lic fa­cade. The lat­ter sum­mons Linux Se­cu­rity Mod­ules (LSM) to de­cide if a par­tic­u­lar al­lo­ca­tion should be treated as a su­pe­ruser one.

How Linux de­cides if it has enough mem­ory de­pends on the vm.over­com­mit_mem­ory sysctl set­ting. The de­fault value is 0. It’s not “dis­able” as you might think but “guess”, which is some­what mis­lead­ing. In this mode, the ker­nel es­ti­mates how many pages it could free if urged to. This in­cludes pages that are free now, the page cache which is shrunk first un­der pres­sure, and ker­nel caches ex­plic­itly marked as re­claimable. Shared mem­ory and re­served pages are ex­cluded, and for non-su­pe­ruser re­quests, vm.ad­min_re­serve_k­bytes of mem­ory (typ­i­cally 8MB) are also set aside. This is to leave root some breath­ing room even if the sys­tem is low on mem­ory. Other­wise, it would be hard for the admin to spawn a shell and fix it. If the re­sult­ing amount is more than re­quested, Linux thinks it would be able to ful­fil the re­quest when the time comes.

Set­ting vm.over­com­mit_mem­ory = 2 (“never”) dis­ables over­com­mit. In this case, the ker­nel eval­u­ates the re­quest against a static limit. You set one ei­ther via

vm.over­com­mit_k­bytes as an ab­so­lute value, or with vm.over­com­mit_ra­tio , as a per­cent­age of the to­tal RAM. The de­fault is 50 per cent. Swap space is also in­cluded with no re­stric­tions. The ker­nel hon­ours

vm.ad­min­s_re­serve_k­bytes as well, but also re­serves three per cent of the process to­tal vir­tual mem­ory or

vm.user_re­serve_k­bytes (128MB), which­ever is smaller, for or­di­nary users.

The fi­nal op­tion, vm.over­com­mit_mem­ory = 1 , is the sim­plest one. It means “al­ways”, and in this case, the re­quest is al­ways granted. This sounds dan­ger­ous, but it’s help­ful if your pro­gram ex­pects large parts of mem­ory to be zero and un­touched. This could be the case for some num­ber crunches such as code work­ing with sparse ma­tri­ces.

Choos­ing an over­com­mit strat­egy is a balanc­ing job. With “guess”, mem­ory al­lo­ca­tions rarely fail but pro­cesses get killed when you least ex­pect it. With “never”, OOM rarely oc­curs and the sys­tem be­haves more pre­dictably, but all pro­grams must pre­pare for their mal­loc()s to fail and han­dle it grace­fully.

Get­ting you some pages

So what would hap­pen if you try to deref­er­ence a pointer you’ve just got with mal­loc() ? In the de­scribed sce­nario, the ker­nel hasn’t mapped mem­ory to the process ad­dress space yet, so the CPU would trig­ger a page fault. This is a hard­ware er­ror, much like di­vi­sion by zero, which in­di­cates that you’ve touched mem­ory you don’t have ac­cess to. Then, the page fault han­dler in the ker­nel comes into play.

Page fault han­dler is a com­plex story which be­gins with the do_­page_­fault() func­tion. Linux needs to check if the process re­ally stepped into a for­bid­den or non­al­lo­cated area. The VMA struc­ture you saw above is the es­sen­tial part of it. If it’s the case, the process re­ceives an in­fa­mous SIGSEGV. If not, the page is ei­ther in the swap or was al­lo­cated, but not mapped yet. In other words, it’s time for the ker­nel to ful­fil its prom­ise and get it to you.

Ker­nel mem­ory al­lo­ca­tor is just another com­plex topic. At the low­est level, it im­ple­ments the al­go­rithm known as “buddy sys­tem” which is ex­posed via al­loc_ pages() and friends. In our case, it’s reached via al­loc_

page_vma() , which does what it says: al­lo­cates a page per VMA. At the end of the day, all routes come to __

al­loc_­pages_n­ode­mask() , which is “the heart of the zoned buddy al­lo­ca­tor”, as the ker­nel says in com­ments.

This func­tion tries to ful­fil the re­quest from a free list first, much like mal­loc() does in the userspace. Free lists track pages that were re­leased re­cently, and it’s cheap­est to get a suit­able chunk there if pos­si­ble. If not, the al­lo­ca­tor runs its costly al­go­rithm which is ir­rel­e­vant for now. But it could also fail if mem­ory is ex­hausted. If

so, the al­lo­ca­tor tries to re­claim some pages. This is the third com­plex piece of in­for­ma­tion that you’ve en­counter in this ar­ti­cle, so it would be enough to say it’s where Linux swaps out un­used pages to disk.

Imag­ine this didn’t help much ei­ther, yet the al­lo­ca­tion isn’t per­mit­ted to fail. Linux has tried every­thing al­ready, so it un­wraps the dooms­day de­vice, and calls the out­_of_mem­ory() func­tion.

The OOM killer

In fact, there are sev­eral ways Linux can get there. Be­sides failed page al­lo­ca­tions, the OOM killer is in­voked when pro­cesses in a mem­ory cgroup (see

LXF236) hit the limit. Or you can trig­ger it man­u­ally with Alt+SysRq+F: this is how I cre­ated the screen­shot shown above.

Ei­ther way, Linux needs to chose what to kill. Things are sim­pler if vm.oom_kil­l_al­lo­cat­ing_­task sysctl is set: if so, the cul­prit is also a vic­tim. This won’t work out though if al­lo­cat­ing process is ei­ther init or a ker­nel thread, be­longs to another cgroup, doesn’t have mem­ory on the re­quired node or is ex­empted from OOM (more on this in a sec­ond). So the ker­nel needs some heuris­tics to se­lect “a bad process”.

In short, you want to kill the least im­por­tant process that takes up a lot of mem­ory. Linux trans­lates this into a nu­mer­i­cal value dubbed an “OOM score”. The process with max­i­mum score wins (if you con­sider this a win, of course) and gets sac­ri­ficed. The base­line is the num­ber of pages a process has in its RSS and swap, to­gether with how big its hard­ware page ta­bles are. Root pro­cesses have their score re­duced by 3 per cent so they’re less likely to be killed.

It’s im­por­tant that you can ad­just the OOM score man­u­ally via /proc/$PID/oom_s­core_adj . This file stores a value be­tween -1,000 and 1,000, and each one point here counts as if a process had al­lo­cated 0.1 per cent of the to­tal pages avail­able. Set­ting it neg­a­tive makes the process less likely to be killed, and -1,000 pre­vents Linux from choos­ing the process as a vic­tim al­to­gether. This is how you make es­sen­tial sys­tem pro­cesses slip off the OOM killer’s hugs of death.

Mem­ory man­age­ment in Linux isn’t ex­actly a light­weight read, but with some un­der­stand­ing, it’s cer­tainly pos­si­ble to make it work the way you want it to. And no one wants another early morn­ing wake-up call on ex­actly the same prob­lem­atic topic, do they?

Pmap re­veals which re­gions a process maps, along with their sizes and per­mis­sion bits. [anon] is where mal­loc() goes.

Htop is where process-re­lated data meets neat user in­ter­face. Did you no­tice the plasma-shell spans 8GB, but re­ally uses only 217MB?

With Elixir, you can ex­plore Linux ker­nel sources from within the com­fort of your favourite web browser. Great for quick over­views.

OOM killer has just de­cided one doesn’t re­ally need Plasma Shell to run Kubuntu desktop. These things hap­pen.

Newspapers in English

Newspapers from Australia

© PressReader. All rights reserved.