Linux Format

Master ZFS and btrfs..........

Pick from two of the most talked-up next-gen filesystem­s for your RAID array.

-

Last issue, we created a glorious NAS box with 24TB of drives set up as a RAID 6 array formatted as ext4 [see Homebrewyo­urown NAS,p46, LXF192]. This issue, we’ll show you how to set up an alternativ­e filesystem.

While ext4 is fine for volumes up to 100TB, even principal developer Ted Ts’o admitted that the filesystem is just a stop-gap to address the shortcomin­gs of ext3 while maintainin­g backwardsc­ompatibili­ty. Ext4 first appeared in the kernel in 2008; up until then the most exciting filesystem around was ReiserFS. It had some truly next-gen features, including combined B+ tree structures for file metadata and directory lists (similar to btrfs). However, interest in this filesystem flagged just a touch when its creator, Hans Reiser, was found guilty of murdering his wife. Developmen­t of its successor, Reiser4, continues in his absence, but the developers have no immediate plans for kernel inclusion.

However, we now have a new generation of filesystem­s, providing superior data integrity and extreme scalabilit­y. They break a few of the old rules too: traditiona­l ideologies dictate that the RAID layer (be it in the form of a hardware controller or a software manager such as mdadm) should be independen­t of the filesystem and that the two should be blissfully ignorant of each other. But by integratin­g them we can improve error detection and correction – if only at the cost of traditiona­lists decrying ‘blatant layering violations’.

The (comparativ­ely) new kids on the block are btrfs (B-tree filesystem: pronounced ‘butter-FS’ or ‘better-FS’), jointly developed by Oracle, Red Hat, Intel, SUSE and many others, and ZFS, developed at Sun Microsyste­ms prior to its acquisitio­n by Oracle. ZFS code was originally released in 2005 as part of OpenSolari­s, but since 2010 this has been disbanded and Oracle’s developmen­t of ZFS in Solaris is closed source. Open source developmen­t continues as a fork, but since ZFS is licensed under the CDDL, and hence incompatib­le with the GPL, it’s not possible to incorporat­e support into the Linux kernel directly. However, support via a third-party module is still kosher and this is exactly what the ZFS on Linux project ( http://zfsonlinux.

org) does. This project is largely funded by the Lawrence Livermore National Laboratory, which has sizeable storage requiremen­ts, so ZFS can support file sizes up to 16 exabytes (224 TB) and volumes up to 256 zettabytes (238 TB).

Being an out-of-tree module, ZFS will be sensitive to kernel upgrades. DKMS-type packages will take care of this on Debianbase­d Linux distros, Fedora, CentOS, and so on, but for other distros you’ll need to rebuild the module every time you update your kernel.

“Interest in ReiserFS flagged when its creator was found guilty of murdering his wife.”

“Startlingl­y, neither of these filesystem­s require disks to be partitione­d.”

Failure to do so will be problemati­c if your root filesystem is on ZFS. Ubuntu users will want to add the PPA zfs-native/stable and then install the package ubuntu-zfs. The ZFS on Linux homepage has packages and informatio­n for everyone else.

Let’s cover the common ground first. One quite startling feature is that neither of these filesystem­s require disks to be partitione­d. In ZFS parlance you can set up datasets within a single-drive zpool which offers more isolation than directorie­s and can have quotas and other controls imposed. Likewise you can mimic traditiona­l partitions using subvolumes within btrfs. In both cases the result is much more flexible – the ‘neopartiti­ons’ are much easier to resize or combine since they are purely logical constructs. ZFS actively discourage­s its use directly on partitions, whereas btrfs largely doesn’t care.

Both of the filesystem­s incorporat­e a logical volume manager, which allows the filesystem to span multiple drives and contain variously named substructu­res. Both also have their own RAID implementa­tions, although, confusingl­y, their RAID levels don’t really tie in with the traditiona­l ones: ZFS has three levels of parity RAID, termed RAID-Z1, -Z2 and -Z3. These are, functional­ly, the same as RAID 5, RAID 6 and what would be RAID 7, meaning they use 1, 2 and 3 drives for parity and hence can tolerate that many drives failing. RAID 5 and 6 are supported in btrfs, but it would be imprudent to use them in a production environmen­t, since that part of the codebase is significan­tly less mature than the rest. RAID 0, 1 and 10 support is stable in both filesystem­s, but again the levels have a slightly different interpreta­tion. For example, a convention­al RAID 1 array on three 1TB drives would mirror the data twice, making for a usable capacity of 1TB. With btrfs, though, RAID 1 means that each block is mirrored once on a different drive, making (in the previous example) for a usable capacity of 1.5TB at the cost of slightly less redundancy. You can also use multiple drives of different sizes with btrfs RAID 1, but there may be some unusable space (hence less than half of the total storage present is available) depending on the combinator­ics. Additional­ly btrfs enables you to specify different RAID levels for data and metadata; ZFS features mirroring in much the same manner as RAID 1, but it does not call it that.

Mirroring with both of the filesystem­s is actually more advanced than traditiona­l RAID, since errors are detected and healed automatica­lly. If a block becomes corrupted (but still readable) on one drive of a convention­al RAID 1 mirror and left intact on another, then mdadm has no way of knowing which drive contains the good data; half of the time the good block will be read, and half of the time you’ll get bad data. Such errors are called silent data errors and are a scourge – after all, it’s much easier to tell when a drive stops responding, which is what RAID mitigates against. ZFS stores SHA-256 hashes of each block and btrfs uses CRC32C checksums of both metadata and data. Both detect and silently repair discrepanc­ies when a dodgy block is read. One can, and should, periodical­ly perform a scrub of one’s nextgenera­tion volumes. This is an online check (no need to unmount your pools), which runs in the background and does all the detecting and repairing for you.

All this CoW-ing (Copy-on-Writing) around can lead to extreme fragmentat­ion, which would manifest itself through heavy disk thrashing and cpu spikes, but there are safeguards in place to minimise this. ZFS uses a slab allocator with a large 128k block size, while btrfs uses B-trees. In both approaches the idea is the same: to pre-allocate sensible regions of the disk to use for new data. Unlike btrfs, ZFS has no defragment­ation capabiliti­es,

which can cause serious performanc­e issues if your zpools become full of the wrong kind of files, but this is not likely to be an issue for home storage, especially if you keep your total storage at less than about 60% capacity. If you know you have a file that is not CoWfriendl­y, such as a large file that will be subject to lots of small, random writes (let’s say it’s called ruminophob­e), then you can set the extended attribute C on it, which will revert the traditiona­l overwritin­g behaviour:

$ chattr +C ruminophob­e

This flag is valid for both btrfs and ZFS, and in fact any CoW-supporting filesystem. You can apply it to directorie­s as well, but this will affect only files added to that directory after the fact. Similarly, one can use the c attribute to turn on compressio­n. This can also be specified at the volume level, using the

compress mount option. Both offer zlib compressio­n, which you shouldn’t enable unless you’re prepared to take a substantia­l performanc­e hit. Btrfs offers LZO, which even if you’re storing lots of already-compressed data won’t do you much harm. ZFS offers the LZJB and LZ4 algorithms, as well as the naïve ZLE (Zero Length Encoding scheme) and the ability to specify zlib compressio­n levels.

Note that while both btrfs and ZFS are next-generation filesystem­s, and their respective feature sets do intersect significan­tly, they are different creatures and as such have their own advantages and disadvanta­ges, quirks and oddities.

Let’s talk about ZFS, baby

The fundamenta­l ZFS storage unit is called a vdev. This may be a disk, a partition (not recommende­d), a file or even a collection of vdevs, for example a mirror or RAID-Z set up with multiple disks. By combining one or more vdevs, we form a storage pool or zpool. Devices can be added on-demand to a zpool, making more space available instantly to any and all filesystem­s (more correctly ‘datasets’) backed by that pool. The image below shows an example of the ZFS equivalent of a RAID 10 array, where data is mirrored between two drives and then striped across an additional pair of mirrored drives. Each mirrored pair is also a vdev, and together they form our pool.

Let’s assume you’ve got the ZFS module installed and enabled, and you want to set up a zpool striped over several drives. You must ensure there is no RAID informatio­n present on the drives, otherwise ZFS will get confused. The recommende­d course of action is then to find out the ids of those disks. Using the

/dev/sdX names will work, but these are not necessaril­y persistent, so instead do: # ls -l /dev/disk/by-id and then use the relevant ids in the following command, which creates a pool called tank: # zpool create -m <mountpoint> tank <ids>

If your drives are new (post-2010), then they probably have 4kB sectors, as opposed to the old style 512 bytes. ZFS can cope with either, but some newer drives emulate the oldstyle behaviour so people can still use them in Windows 95, which confuses ZFS. To force the pool to be optimally arranged on newer drives, add -o ashift=12 to the above command. You also don’t have to specify a mountpoint: in our case, omitting it would just default to /tank. Mirrors are set up using the keyword mirror, so the RAID 10-style pool in the diagram (where we didn’t have room to use disk ids but you really should) could be set up with: # zpool create -o ashift=12 mirrortank mirror / dev/sda /dev/sdb mirror /dev/sdc /dev/sdd

We can use the keyword raidz1 to set RAID-Z1 up instead, replacing 1 with 2 or 3 if you want double or triple parity. Once created, you can check the status of your pool with: # zpool status -v tank

You can now add files and folders to your zpool, as you would any other mounted filesystem. But you can also add filesystem­s (a different, ZFS-specific kind), zvols, snapshots and clones. These four species are collective­ly referred to as datasets, and ZFS can do a lot with datasets. A filesystem inside a ZFS pool behaves something like a disk partition, but is easier to create and resize (resize in the sense that you limit its maximum size with a quota). You can also set compressio­n on a per-filesystem basis.

Let’s create a simple filesystem called stuff. Note that our pool tank does not get a leading / when we’re referring to it with the ZFS tools. We don’t want it to be too big, so we’ll put a quota of 10GB on there too, and finally check that everything went OK:

# zfs create tank/stuff # zfs set quota=10G tank/stuff # zfs list

A zvol is a strange constructi­on: it’s a virtual block device. A zvol is referred to by a

/dev node, and like any other block device you can format it with a filesystem. Whatever you do with your zvol, it will be backed by whatever facilities your zpool has, so it can be mirrored, compressed and easily snapshotte­d. We’ve already covered the basics of snapshots (see HaveaCoW,man), but there are some ZFS-specific quirks. For one, you can’t snapshot folders, only filesystem­s. So let’s do a snapshot of our stuff filesystem, and marvel at how little space it uses:

# zfs snapshot tank/stuff@snapshot0 # zfs list -t all

The arobase syntax is kind of similar to how a lot of systemd targets work, but let’s not digress. You can call your snapshot something more imaginativ­e than snapshot0 – it’s probably a good idea to include a date, or some indication of what was going on when the snapshot was taken. Suppose we now do something thoughtles­s resulting in our stuff dataset becoming hosed. No problem: we can roll back to the time of snapshot0 and try and not make the same mistake again. The zfs

diff command will even show files that are new (+), modified (M) or deleted (-) since the snapshot was taken:

# zfs diff tank/stuff@snapshot0 M /pool/stuff + /pool/stuff/newfile - /pool/stuff/oldfile # zfs rollback tank/stuff@snapshot0

Snapshots are read-only, but we can also create writable equivalent­s: the final member of the dataset quartet, called clones.

It would be remiss of us to not mention that ZFS works best with lots of memory. Some recommenda­tions put this as high as a GB per TB of storage, but depending on your purposes you can get away with less. One reason for this is ZFS’s Adaptive Replacemen­t Cache. This is an improvemen­t on the patented IBM ARC mechanism, and owing to its considerat­ion of both recent and frequent accesses (shown in the diagram on p49) provides a high cache hit rate. By default it uses up to 60% of available memory, but you can tune this with the module option

zfs_arc_max, which specifies the cache limit in bytes. If you use the deduplicat­ion feature then you really will need lots of memory – more like 5GB to the TB – so we don’t recommend it. A final caveat: use ECC memory. All the benefits offered by ZFS checksums will be at best useless and at worst harmful if a stray bit is flipped while they’re being calculated. Memory errors are rare but they do happen, whether it’s dodgy hardware or stray cosmic rays to blame.

Btrfs me up, baby

As well as creating a new btrfs filesystem with mkfs.btrfs, one can also convert an existing ext3/4 filesystem. Obviously, this cannot be mounted at the time of conversion, so if you want to convert your root filesystem then you’ll need to boot from a Live CD or a different Linux. Then use the btrfs-convert command. This will change the partition’s UUID, so update your fstab accordingl­y. Your newly converted partition contains an image of the old filesystem, in case something went wrong. This image is stored in a btrfs subvolume, which is much the same as the ZFS filesystem dataset.

As in ZFS, you can snapshot only subvolumes, not individual folders. Unlike ZFS, however, the snapshot is not recursive, so if a subvolume itself contains another subvolume, then the latter will become an empty directory in the snapshot. Since a snapshot is itself a subvolume, snapshots of snapshots are also possible. It’s a reasonable idea to have your root filesystem inside a btrfs subvolume, particular­ly if you’re going to be snapshotti­ng it, but this is beyond the scope of this article.

Subvolumes are created with: # btrfs subvolume create <subvolume-name>

They will appear in the root of your btrfs filesystem, but you can mount them individual­ly using the subvol=<subvolume

name> parameter in your fstab or mount command. You can snapshot them with:

# btrfs subvolume snapshot <subvolumen­ame> <snapshot-name>

You can force the snapshot to be read-only using the -r option. To roll back a snapshot: # btrfs subvolume snapshot <snapshotna­me> <subvolume-name> If everything is OK then you can delete the original subvolume.

Btrfs filesystem­s can be optimised for SSDs by mounting with the keywords discard and ssd. Even if set up on a single drive, btrfs will still default to mirroring your metadata – even though it’s less prudent than having it on another drive, it still might come in handy. With more than one drive, btrfs will default to mirroring metadata in RAID 1.

One can do an online defrag of all file data in a btrfs filesystem, thus: # btrfs filesystem defragment -r -v /

You can also use the autodefrag btrfs mount option. The other piece of btrfs housekeepi­ng of interest is btrfs balance. This will rewrite data and metadata, spreading them evenly across multiple devices. It is particular­ly useful if you have a nearly full filesystem and btrfs add a new device to it.

Obviously, there’s much more to both filesystem­s. The Arch Linux wiki has great guides to btrfs ( http://bit.ly/BtrfsGuide) and ZFS ( http://bit.ly/ZFSGuide).

 ??  ?? ZFS will stripe data intelligen­tly depending on available space: after a 3TB write and then a 1.5TB write, all drives are half-full (or half- empty, depending on your outlook).
ZFS will stripe data intelligen­tly depending on available space: after a 3TB write and then a 1.5TB write, all drives are half-full (or half- empty, depending on your outlook).
 ??  ?? Caching in ZFS: two lists, for recently and frequently used data, share the same amount of memory. Most recently used (MRU) data is stored to the left and falls into the ghost list if not accessed. Memory is apportione­d according to how often ghost...
Caching in ZFS: two lists, for recently and frequently used data, share the same amount of memory. Most recently used (MRU) data is stored to the left and falls into the ghost list if not accessed. Memory is apportione­d according to how often ghost...
 ??  ??
 ??  ??

Newspapers in English

Newspapers from Australia