Jon Masters on the latest updates
The legendary Jon Masters’ final column, where he summarises the latest happenings in the Linux kernel community
Linus Torvalds announced the release of Linux 4.19-rc1 (Release Candidate 1),and with it the closure of the 4.19 merge window. The merge window is the period of time (almost always two weeks long) during which disruptive changes and new features are allowed into the kernel. This is followed by a number ofweekly RCs and a final release about two months later. The latest merge window was “fairly frustrating” to Linus because “4.19 looks [to him] to be a pretty big release”, and it came right as a new hardware security vulnerability also needed to be patched (more on that below). Linus also drew attention to fixes for a nasty TLB shootdown bug (also covered below) that consumed his time as he was handling other merge window activity. He added that, “while this isn’t the biggest release we’ve had (4.9 still keeps that crown), this does join 4.12 and 4.15 as one of the bigger kernel releases”. He later announced a follow-up RC2, which contained mostly networking and GPU driver fixes.
L1 terminal fault explained
Another month, another vulnerability disclosure impacting the ‘speculative’ design of modern microprocessors. As we’ve mentioned in the past, contemporary high performance processors (such as those in your laptop, or in a server) try to predict future work that they will need to perform before they know which direction a program will take in its operation.
This is known as speculation and its supposed to behave as a black box: either the work done ends up being needed, and becomes part of the ‘architectural state’ of the machine, or it is thrown away and its results are supposed to not be visible to programmers or users. Since this is the intent, various additional optimisations are common in which the processor will also speculate that certain other things are true, such as the result of a permission check (as in Meltdown).
As Meltdown demonstrated, if the speculative activity can be (indirectly) observed, certain other performance optimisations relying upon speculation become security issues, potentially allowing unprivileged users to gain access to sensitive information. The latest problem concerns a design optimisation Intel made to its handling of page tables. These are used by a part of the processor known as the Memory Management Unit (MMU) to handle the mapping of programmer-visible virtual addresses into underlying hardware physical addresses. The abstraction of virtual memory is what allows every program to think it has unlimited amounts of RAM, with the kernel being able to swap in and out memory from disk whenever the amount of free RAM is low. The kernel marks memory pages that are swapped as invalid (‘not present’).
Every running task or process has its own set of page tables, as does the kernel itself. Whenever virtualisation is in use, the Hypervisor (KVM, say) also uses its own page tables to translate what the guest virtual machine sees as its memory into real underlying memory in the host. Thus, two separate sets of translations may be required. Page table walks, as they are known, can be slow, so the processor uses various optimisations to speed them up, including a small internal cache of recent translations known as a TLB (Translation Lookaside Buffer). Even then, walking through page tables is so slow that the processor might speculate about the result ahead of time.
In Intel’s case, it speculates that the page table entry (the leaf node in a page table walk) is valid before completing the validity check. This allows carefully crafted applications to cause the creation of malicious ‘not present’ page table entries (PTEs) that look to Linux just like swapped-out memory. Various forms of attack exist, with the most dangerous relying upon a second bug that affects virtualisation. When the hardware sees a ‘not present’ page table entry, it will skip the second page-table walk in the hypervisor and treat the partially translated address as being a hypervisor physical memory address, allowing a specially crafted malicious VM to read hypervisor memory.
Pulling off an attack is currently difficult, since ready exploits are not yet in the wild, and relies upon causing data of interest to also be loaded into the processor’s level 1 data cache (L1D). Patched kernels carefully avoid accidentally creating malicious ‘not present’ pages, and mitigate against the hypervisor attack by flushing the L1 data cache whenever beginning to run potentially untrusted virtual machines.
This can be controlled using a new ‘l1tf’ kernel command line, and the status of any vulnerability and possible active mitigation is visible in /sys/devices/system/cpu/vulnerabilities/l1tf.
Linus mentioned that a nasty TLB shootdown bug had
been found during the 4.19 merge window. As we learned above, TLBs are used to cache the translations between virtual addresses used by applications (or the kernel) and those seen by the physical hardware. Since every running application has its own (virtual) view of memory, these TLB entries need to be maintained such that only those translations that are supposed to be active are seen by an application accessing memory. Otherwise, it would be possible for a memory access by one application to interfere with another, or for it to see a stale translation into memory that was since recycled for another use.
As a consequence, the kernel routinely performs ‘shootdowns’ of TLB entries as it performs page table maintenance operations. The exact process differs from one architecture to another, but usually involves specially privileged processor instructions, as is the case on x86. Unfortunately, some time ago, an attempt was made to optimise the process of tearing down user process page tables and the result was that x86 might – in some cases – not do the necessary TLB flush. This resulted in a very hard to debug issue that also had security implications.
It’s interesting to see how the change had originally occurred through an attempt to make PowerPCspecific code generic across architectures. Linus was grumpy because the net result had been that “x86 had unintentionally lost the TLB flush we used to have”.
The problem had then existed for a long time because it only affected a rarely used ‘slow path’ – triggering an exploit required that the machine intentionally ran out of memory. The thread is titled “Remove tlb_remove_table() non-concurrent condition”, and any theoretical security impact is also fixed in the updated patches.
Meanwhile, Pu Wen (of Hygon) posted several rounds of updated patches for the Hygon Dhyana family of x86 processors. These are based upon AMD’s Zen microarchitecture as a result of a joint venture between AMD and a Chinese group known as Chengdu Haiguang IC Design Co. – Hygon. Dhyana appears to use AMD’s Zen cores on a custom SoC (chip), so much so that the only real changes needed to enable support for these chips in Linux is adding some new IDs, and replacing CPUID detection of ‘AuthenticAMD’ with ‘HygonGenuine’.
Atish Patra (of Western Digital) posted various cleanups to the SMP (Symmetric Multi-Processing) supporting code for RISC-V architecture. These patches focus on one implementation (from SiFive), which is believed to be used in some future Western Digital disk drives, but more generally are good examples of the rate of progress. Apparently, the work was inspired by ARM’s SMP code.
Quentin Perret (of ARM) posted version 6 of a long-running patch series titled Energy Aware Scheduling. The EAS (Energy Aware Scheduler) attempts to divide a machine into performance domains built from various components, such as processor cores. The energy requirements of these different performance domains vary, as does the amount of computation they can provide. The basic idea of the EAS, then, is to prefer to schedule newly awaking tasks onto energy-efficient cores, and to migrate them onto higher performance, higher energy cores when needed. Various logic in the patch set ensures that once a ‘tipping point’ is reached and the machine is sufficiently loaded, it will fall back to the traditional scheduling philosophy in use today. EAS is not specific to ARM and in fact it is hoped it will be supported across many different architectures. Once enabled, /sys/devices/system/ cpu gains an ‘energy_model’ directory showing the energy cost and relative performance of the various performance domains.
The basic idea of the Energy Aware Scheduler is to prefer to schedule newly awaking tasks onto energyefficient cores