Cgroups: “c” for “control”

Resources are sometimes limited, so how do you go about enforcing these limits? Here’s one way to moderate the appetites of Linux processes…

2018-04-10 -

Those of you who keenly follow technology trends will know that containers are the new big thing. You use them to build, ship and deploy pieces of code, usually in the form of microservices. Ignore the slightly ironic tone, because containerisation is indeed a useful approach. Yet we still feelt alone in calling it a marketing buzzword. Why? Because there’s no such thing as a container in Linux.

At their core, containers can be distilled down into two concepts. One is namespaces, and it’s how they achieve isolation. For instance, a process namespace ensures that pid 1 in your container and pid 1 in your host OS are two different things. The second are control groups or “cgroups”. They’re the mechanism that ensures fair resource consumption, whatever you deem fair. In fact, cgroups have many uses beyond containers. In this month’s Administeria, we’ll explore how they work in systemd, Docker and friends.

Tag them

Cgroups are also two-fold. They’re hierarchical namespaces of labels that you can attach to processes (actually, it’s the other way around). And they’re also controllers that enforce resource limits per label hierarchy. In this way, child labels are kept within limits set by their parents. So you can set limits per user session and per service within each session.

Hierarchical structures map to filesystem trees naturally, and that’s the route cgroups take. The kernel provides you with the “cgroup” filesystem, which you can mount wherever you like, yet /sys/fs/cgroup (itself a tmpfs) is a typical choice today. With systemd, you have it mounted by default. If not, do it yourself as with any other filesystem: $ sudo mount -t cgroup -o cpu cgroup /some/where

Here, “cpu” is the controller’s name. You can also use “all” to mount all available controllers. If this command says “cgroup already mounted”, this means systemd or something has already mounted the controller for you. If controllers A, B, and C are mounted, it’s okay to remount all three, but not A (or B, or C) alone. To unmount the controller, it shouldn’t be “busy” – that is, has any processes assigned. Long story short: if you already have /sys/fs/cgroup, stick with it for this tutorial.

To complicate matters a little more, there are two cgroups versions: v1 and v2. The cgroups(7) man page explains the difference, but for now, it’s enough to know that in cgroups v2, all controllers are mounted in a single unified hierarchy. Systemd does this under

/sys/fs/cgroup/unified. You can’t mount one controller under both v1 and v2 hierarchies. Moreover, cgroups v2 treat threads differently and provides better events mechanism, but that’s another story…

With cgroups mounted at /sys/fs/cgroup or elsewhere, you can start creating labels (hierarchies) and assigning them processes: $ sudo mkdir /sys/fs/cgroup/cpu/my-group $ echo $$ | sudo tee /sys/fs/cgroup/cpu/my-group/ cgroup.procs 14130

The last line is how you write to a root-owned file while not being root. That’s it: you’ve moved the current process (likely a shell) to /my-group. See the following: $ cat /proc/self/cgroup ... 4:cpu,cpuacct:/my-group

4 is an opaque hierarchy ID number. cpu and cpuacct are controller names: they go side by side as systemd mounts them together on my system.

Cgroups are inherited across forks, and so if you spawn a command from the shell then it would end up in the same cgroup:

$ tail -n 2 /sys/fs/cgroup/cpu/my-group/cgroup.procs 14130 14439

14130 is the shell’s PID, and when we ran “tail” to read processes in the shell’s cgroup, it saw its own PID (14439) added automatically.

Cap them

Cgroups provide a controller per resource type such as CPU or memory. Process hierarchies are directories; controllers are files within these directories. You write to them to enforce resource limits and read them back to identify current settings or stats. There’s a dozen of cgroup v1 controllers (see the boxout, right), so it would be tiresome to cover them all here. Instead, we’ll take a look at some simpler ones. For everything else, please refer to the Linux kernel’s documentation.

Let’s begin with the already-mentioned “cpu” controller. As we saw, systemd mounts its together with “cpuacct”. The “cpu” controller manages so-called “CPU shares” via cpu.shares file. Each process normally has 1,024 shares, and you increase or decrease this number to assign it more or less CPU time. This only works when a CPU is busy, though. On an idle chip, a process time is not restricted: no competition, no limits. You can see if processes in the group were ever throttled, and for how long, in the cpu.stat file: $ cat /sys/fs/cgroup/cpu/my-group/cpu.stat nr_periods 0 nr_throttled 0 throttled_time 0

nr_periods is how many times scheduler is needed to enforce resource limits. nr_throttled is the number of times processes in the group were limited for it. They’re not the same numbers, as scheduler might have decided to throttle something else. Finally, throttled_time is for how long the group was throttled. The laptop which we currently use as a typewriter is not under heavy load (we don’t type that fast), so have zeros in all these fields.

“cpuacct” is an example of a controller that doesn’t cap anything. Instead, it carries out CPU accounting across the hierarchy: stats are accumulated in a topdown manner. So if you have, say /services, /services/

NetworkManager, /services/cups and so on, you instantly obtain stats for each service and for all services (as opposed to user programs) together. In other words, “cpu” sets resource limits and “cpuacct” tells the actual consumption: no wonder they go together.

At each level of the cpuacct-managed hierarchy, there’s a cpuacct.usage file containing the number of nanoseconds processes in this group and its children spent on the CPU. So /sys/fs/cgroup/cpuacct.usage gives the CPU time allotted to all threads in the system. Other cpuacct.* files aggregate per-processor or user/ system stats: again, the details are in the kernel docs.

Limiting memory

These days, there’s no such resource as memory, they say. Nevertheless, there’s the “memory” controller limiting it for you. Linux strongly believes in memory overcommitment though, so it’s not a problem to ask for a lot of memory: the problem is when you start using it.

There are a few levers that you can pull to limit normal process memory, kernel memory and swap space. A file called memory.limit_in_bytes configures process memory limits, memory.memsw.limit_in_bytes adds swap space to a mix, and memory.kmem.limit_in_

bytes limits kernel memory. The latter typically include kernel mode stacks, objects the kernel allocated on the process’ behalf (such as open files) and sockets. A finergrained control is available for sockets, so you can limit the memory used for TCP buffers.

While the limit is expressed in bytes, it’s possible to use other units such as MB or GB when you set them: $ sudo mkdir /sys/fs/cgroup/memory/my-group $ echo 4G | sudo tee /sys/fs/cgroup/memory/ my-group $ cat /sys/fs/cgroup/memory/my-group 4294967296

Note that it’s not possible to impose limits on the root memory cgroup.

What happens if the process goes out of its cgroup limits? The basic procedure is the same as for an ordinary, non-constrained process. First, the kernel tries to reclaim some memory and swaps the process. If this doesn’t help, an OOM killer is invoked to shut down the most insolent process in a cgroup (hopefully).

It’s also possible to unset memory limits. To do so, just write -1 to a corresponding file. Reading the value back yields 9223372036854771712. That’s exactly half of 2^64 minus one page. Some may say it’s a maximum userspace address that a 64-bit app would ever have (not on today’s processors though). Others may just call it unlimited.

A real-world example

Now you have some idea of how you manage cgroups manually. This helps to build a complete picture (which is always a good thing), but it’s practically useless: you don’t manage cgroups in this way. Instead, you employ some higher-level knobs such as those in systemd or Docker.

Let’s start with the former. If the host kernel has the “cpu” controller enabled (most stock kernels today have), then systemd will automatically add a cgroup for each service that it starts: $ ls -1 /sys/fs/cgroup/cpu/system.slice

... networking.service systemd-journald.service ...

The actual output spans more than 60 lines on a Kubuntu 17.10 system. This already has a nice sideeffect. Remember that all cgroups receive the same 1,024 CPU shares by default. This means, all services get the same amount of computing power which Lennart deems to be a sane default. If you don’t agree, just set CPUShares to something else in the corresponding service unit configuration file. Memory limits are imposed much the same way: systemd.resourcecontrol(5) has all the details.

None of the services actually redefine limits on our systems. So, let’s have a sneak peak at how Docker manages resources for us. We have version 17.09.1-ce installed. Then we took some random image out of those locally and spawned a container, which received ID 4d1f57c49ef5, like this: $ docker run --rm=true --cpu-shares=512 --memory=4G ...

Looking under /sys/fs/cgroup reveals that Docker has created a /docker/<container ID> hierarchy for each of the controllers: $ find /sys/fs/cgroup -name 4d1f57c49ef5\* ... /sys/fs/cgroup/memory/docker/4d1f57c49ef5... /sys/fs/cgroup/cpu,cpuacct/docker/4d1f5... ...

Checking the corresponding controller files shows that Docker has applied limits we’ve specified on the command line: $ cat /sys/fs/cgroup/cpu,cpuacct/docker/4d1f5.../cpu. shares 512 $ grep “” /sys/fs/cgroup/memory/ docker/4d1f57c49ef5.../*.limit_in_bytes /sys/fs/cgroup/memory/docker/4d1f57c49ef5.../ memory.kmem.limit_in_bytes:9223372036854771712 /sys/fs/cgroup/memory/docker/4d1f57c49ef5.../ memory.kmem.tcp.limit_in_ bytes:9223372036854771712 /sys/fs/cgroup/memory/docker/4d1f57c49ef5.../ memory.limit_in_bytes:4294967296

Note that only process memory is limited. Our system lacks swap limit support so it’s unrestricted, and we choose not to limit kernel memory either.

Now you understand what cgroups are and how they can be useful – even for those of us who are not following industry trends and still deploying with packages. If you want your service to consume no more than a given CPU or set amount of memory, run on specific cores or be lightweight on disk I/O, then cgroups should be your first stop.

?? ?? You don’t have to be a kernel hacker to read Linux kernel docs – it’s an authoritative reference for cgroups. — You don’t have to be a kernel hacker to read Linux kernel docs – it’s an authoritative reference for cgroups.

?? ?? If you think memory limits in Docker look suspiciously similar to what “memory” cgroup controller provides, you’re absolutely right. — If you think memory limits in Docker look suspiciously similar to what “memory” cgroup controller provides, you’re absolutely right.

Cgroups: “c” for “control”

Resources are sometimes limited, so how do you go about enforcing these limits? Here’s one way to moderate the appetites of Linux processes…

Newspapers in English

Newspapers from Australia