Ker­nel Col­umn

Linux User & Developer - - Contents -

Jon Masters on the lat­est up­dates

The leg­endary Jon Masters’ fi­nal col­umn, where he sum­marises the lat­est hap­pen­ings in the Linux ker­nel com­mu­nity

Li­nus Tor­valds an­nounced the re­lease of Linux 4.19-rc1 (Re­lease Can­di­date 1),and with it the clo­sure of the 4.19 merge win­dow. The merge win­dow is the pe­riod of time (al­most al­ways two weeks long) dur­ing which dis­rup­tive changes and new fea­tures are al­lowed into the ker­nel. This is fol­lowed by a num­ber ofweekly RCs and a fi­nal re­lease about two months later. The lat­est merge win­dow was “fairly frus­trat­ing” to Li­nus be­cause “4.19 looks [to him] to be a pretty big re­lease”, and it came right as a new hard­ware se­cu­rity vul­ner­a­bil­ity also needed to be patched (more on that be­low). Li­nus also drew at­ten­tion to fixes for a nasty TLB shoot­down bug (also cov­ered be­low) that con­sumed his time as he was han­dling other merge win­dow ac­tiv­ity. He added that, “while this isn’t the big­gest re­lease we’ve had (4.9 still keeps that crown), this does join 4.12 and 4.15 as one of the big­ger ker­nel re­leases”. He later an­nounced a fol­low-up RC2, which con­tained mostly net­work­ing and GPU driver fixes.

L1 ter­mi­nal fault ex­plained

An­other month, an­other vul­ner­a­bil­ity dis­clo­sure im­pact­ing the ‘spec­u­la­tive’ de­sign of mod­ern mi­cro­pro­ces­sors. As we’ve men­tioned in the past, con­tem­po­rary high per­for­mance pro­ces­sors (such as those in your lap­top, or in a server) try to pre­dict fu­ture work that they will need to per­form be­fore they know which di­rec­tion a pro­gram will take in its oper­a­tion.

This is known as spec­u­la­tion and its sup­posed to be­have as a black box: ei­ther the work done ends up be­ing needed, and be­comes part of the ‘ar­chi­tec­tural state’ of the ma­chine, or it is thrown away and its re­sults are sup­posed to not be vis­i­ble to pro­gram­mers or users. Since this is the in­tent, var­i­ous ad­di­tional op­ti­mi­sa­tions are com­mon in which the pro­ces­sor will also spec­u­late that cer­tain other things are true, such as the re­sult of a per­mis­sion check (as in Melt­down).

As Melt­down demon­strated, if the spec­u­la­tive ac­tiv­ity can be (in­di­rectly) ob­served, cer­tain other per­for­mance op­ti­mi­sa­tions re­ly­ing upon spec­u­la­tion be­come se­cu­rity is­sues, po­ten­tially al­low­ing un­priv­i­leged users to gain ac­cess to sen­si­tive in­for­ma­tion. The lat­est prob­lem con­cerns a de­sign op­ti­mi­sa­tion In­tel made to its han­dling of page ta­bles. These are used by a part of the pro­ces­sor known as the Mem­ory Man­age­ment Unit (MMU) to han­dle the map­ping of pro­gram­mer-vis­i­ble vir­tual ad­dresses into un­der­ly­ing hard­ware phys­i­cal ad­dresses. The ab­strac­tion of vir­tual mem­ory is what al­lows ev­ery pro­gram to think it has un­lim­ited amounts of RAM, with the ker­nel be­ing able to swap in and out mem­ory from disk when­ever the amount of free RAM is low. The ker­nel marks mem­ory pages that are swapped as in­valid (‘not present’).

Ev­ery run­ning task or process has its own set of page ta­bles, as does the ker­nel it­self. When­ever vir­tu­al­i­sa­tion is in use, the Hyper­vi­sor (KVM, say) also uses its own page ta­bles to trans­late what the guest vir­tual ma­chine sees as its mem­ory into real un­der­ly­ing mem­ory in the host. Thus, two sep­a­rate sets of trans­la­tions may be re­quired. Page ta­ble walks, as they are known, can be slow, so the pro­ces­sor uses var­i­ous op­ti­mi­sa­tions to speed them up, in­clud­ing a small in­ter­nal cache of re­cent trans­la­tions known as a TLB (Trans­la­tion Looka­side Buf­fer). Even then, walk­ing through page ta­bles is so slow that the pro­ces­sor might spec­u­late about the re­sult ahead of time.

In In­tel’s case, it spec­u­lates that the page ta­ble en­try (the leaf node in a page ta­ble walk) is valid be­fore com­plet­ing the va­lid­ity check. This al­lows care­fully crafted ap­pli­ca­tions to cause the cre­ation of ma­li­cious ‘not present’ page ta­ble en­tries (PTEs) that look to Linux just like swapped-out mem­ory. Var­i­ous forms of at­tack ex­ist, with the most dan­ger­ous re­ly­ing upon a sec­ond bug that af­fects vir­tu­al­i­sa­tion. When the hard­ware sees a ‘not present’ page ta­ble en­try, it will skip the sec­ond page-ta­ble walk in the hyper­vi­sor and treat the par­tially trans­lated ad­dress as be­ing a hyper­vi­sor phys­i­cal mem­ory ad­dress, al­low­ing a spe­cially crafted ma­li­cious VM to read hyper­vi­sor mem­ory.

Pulling off an at­tack is cur­rently dif­fi­cult, since ready ex­ploits are not yet in the wild, and re­lies upon caus­ing data of in­ter­est to also be loaded into the pro­ces­sor’s level 1 data cache (L1D). Patched ker­nels care­fully avoid ac­ci­den­tally cre­at­ing ma­li­cious ‘not present’ pages, and mit­i­gate against the hyper­vi­sor at­tack by flush­ing the L1 data cache when­ever be­gin­ning to run po­ten­tially un­trusted vir­tual machines.

This can be con­trolled us­ing a new ‘l1tf’ ker­nel com­mand line, and the sta­tus of any vul­ner­a­bil­ity and pos­si­ble ac­tive mit­i­ga­tion is vis­i­ble in /sys/de­vices/sys­tem/cpu/vul­ner­a­bil­i­ties/l1tf.

Li­nus men­tioned that a nasty TLB shoot­down bug had

been found dur­ing the 4.19 merge win­dow. As we learned above, TLBs are used to cache the trans­la­tions be­tween vir­tual ad­dresses used by ap­pli­ca­tions (or the ker­nel) and those seen by the phys­i­cal hard­ware. Since ev­ery run­ning ap­pli­ca­tion has its own (vir­tual) view of mem­ory, these TLB en­tries need to be main­tained such that only those trans­la­tions that are sup­posed to be ac­tive are seen by an ap­pli­ca­tion ac­cess­ing mem­ory. Oth­er­wise, it would be pos­si­ble for a mem­ory ac­cess by one ap­pli­ca­tion to in­ter­fere with an­other, or for it to see a stale trans­la­tion into mem­ory that was since re­cy­cled for an­other use.

As a con­se­quence, the ker­nel rou­tinely per­forms ‘shoot­downs’ of TLB en­tries as it per­forms page ta­ble main­te­nance op­er­a­tions. The ex­act process dif­fers from one ar­chi­tec­ture to an­other, but usu­ally in­volves spe­cially priv­i­leged pro­ces­sor in­struc­tions, as is the case on x86. Un­for­tu­nately, some time ago, an at­tempt was made to optimise the process of tear­ing down user process page ta­bles and the re­sult was that x86 might – in some cases – not do the nec­es­sary TLB flush. This re­sulted in a very hard to de­bug is­sue that also had se­cu­rity im­pli­ca­tions.

It’s in­ter­est­ing to see how the change had orig­i­nally oc­curred through an at­tempt to make Pow­erPCspe­cific code generic across ar­chi­tec­tures. Li­nus was grumpy be­cause the net re­sult had been that “x86 had un­in­ten­tion­ally lost the TLB flush we used to have”.

The prob­lem had then ex­isted for a long time be­cause it only af­fected a rarely used ‘slow path’ – trig­ger­ing an ex­ploit re­quired that the ma­chine in­ten­tion­ally ran out of mem­ory. The thread is ti­tled “Re­move tl­b_re­move_table() non-con­cur­rent con­di­tion”, and any the­o­ret­i­cal se­cu­rity im­pact is also fixed in the up­dated patches.

Mean­while, Pu Wen (of Hy­gon) posted sev­eral rounds of up­dated patches for the Hy­gon Dhyana fam­ily of x86 pro­ces­sors. These are based upon AMD’s Zen mi­croar­chi­tec­ture as a re­sult of a joint ven­ture be­tween AMD and a Chi­nese group known as Chengdu Haiguang IC De­sign Co. – Hy­gon. Dhyana ap­pears to use AMD’s Zen cores on a cus­tom SoC (chip), so much so that the only real changes needed to en­able sup­port for these chips in Linux is adding some new IDs, and re­plac­ing CPUID de­tec­tion of ‘Authen­ticAMD’ with ‘Hy­gonGen­uine’.

Atish Pa­tra (of Western Dig­i­tal) posted var­i­ous cleanups to the SMP (Sym­met­ric Multi-Pro­cess­ing) sup­port­ing code for RISC-V ar­chi­tec­ture. These patches fo­cus on one im­ple­men­ta­tion (from SiFive), which is be­lieved to be used in some fu­ture Western Dig­i­tal disk drives, but more gen­er­ally are good ex­am­ples of the rate of progress. Ap­par­ently, the work was in­spired by ARM’s SMP code.

Quentin Per­ret (of ARM) posted ver­sion 6 of a long-run­ning patch se­ries ti­tled En­ergy Aware Sched­ul­ing. The EAS (En­ergy Aware Sched­uler) at­tempts to di­vide a ma­chine into per­for­mance do­mains built from var­i­ous com­po­nents, such as pro­ces­sor cores. The en­ergy re­quire­ments of these dif­fer­ent per­for­mance do­mains vary, as does the amount of com­pu­ta­tion they can pro­vide. The ba­sic idea of the EAS, then, is to pre­fer to sched­ule newly awak­ing tasks onto en­ergy-ef­fi­cient cores, and to mi­grate them onto higher per­for­mance, higher en­ergy cores when needed. Var­i­ous logic in the patch set en­sures that once a ‘tip­ping point’ is reached and the ma­chine is suf­fi­ciently loaded, it will fall back to the tra­di­tional sched­ul­ing phi­los­o­phy in use to­day. EAS is not spe­cific to ARM and in fact it is hoped it will be sup­ported across many dif­fer­ent ar­chi­tec­tures. Once en­abled, /sys/de­vices/sys­tem/ cpu gains an ‘en­er­gy_­model’ directory show­ing the en­ergy cost and rel­a­tive per­for­mance of the var­i­ous per­for­mance do­mains.

The ba­sic idea of the En­ergy Aware Sched­uler is to pre­fer to sched­ule newly awak­ing tasks onto en­er­gy­ef­fi­cient cores

Newspapers in English

Newspapers from UK

© PressReader. All rights reserved.