Benchmarking

Pop quiz hot shot: what’s the fastest filesystem? John Lane runs some tests and uses gnuplot to help bring clarity to the storage table…

2018-10-23 -

What’s the fastest filesystem? John Lane guides you through testing and benchmarking drives and creating fancy

gnuplot graphs to bring some clarity.

Linux users are blessed with a plethora of storage options beyond those selected by the typical installer. Reviews abound on what’s available: the latest SSD or hard drive, choices of file system, the pros and cons of encryption. But how do the options compare, and how can you test your own system before stepping off the well-trodden ext4 path?

In this tutorial we’ll compare several filesystems, from the de-facto standard ext4 through alternatives such as XFS, IFS and the ill-fated Reiserfs. We’ll include those Microsoft filesystems we can’t avoid encountering with NTFS and vfat, and also test the oft-called next-gen offerings that are Btrfs and ZFS.

We’ll run some tests using tools that most distributions include by default and then we’ll look at what the Kernel developers use: their flexible I/O Tester, or just fio. Benchmarking produces lots of numbers so we’ll use gnuplot to produce graphs to help make sense of it all.

But before we begin, a word of warning: these tools perform both read and write operations and are capable of overwriting your precious data. It’s best to benchmark devices before anything of value is stored on them.

Quick tests

Most Linux distributions include a tool called hdparm that you can use to run a quick and simple benchmark. It will quickly give you an idea of how fast Linux can access a storage device. It times device reads, either buffered disk reads (with its -t command-line option) or cached reads ( -T ) or both.

The former reads through the kernel’s page cache to the disk without prior caching of data (which demonstrates how fast the disk can deliver data),

whereas the latter reads pre-cached data without disk access (see man hdparm; we introduce the page cache in the box on page 74): $ sudo hdparm -t -T /dev/sdX Timing cached reads: 30596 MB in 1.99 seconds = 15358.63 MB/sec Timing buffered disk reads: 334 MB in 3.00 seconds = 111.29 MB/sec

You need permission to read from the device you’re testing (which we specify as /dev/sd<X> – replace <X> to match yours), you can either use sudo to run as root or arrange for your user to be appropriately entitled, typically by being a member of the disk group (our examples use sudo for simplicity).

When commands, like hdparm, only produce humanreadable reports you will need to extract the important information and format it for use by other applications such as gnuplot. The awk command-line utility is most useful for this and it’s worth taking a moment to learn some of its syntax if this is new to you – it will serve you well (we looked at it in LinuxFormat issues LXF193,

LXF191 and LXF177).

Do it again...

Whilst it’s fine to run a test once to get a quick measure it’s best, as with any experiment, to take the average of several measurements. So we run each benchmark multiple times. You could use something like this shell script to run

hdparm a few times and format the results ready for input to gnuplot: #!/bin/bash # filename hdparm_awk echo {,d}{c,b}_{total,time,speed} for ((i=10; i>0; i--)) {{ echo -n . >&2 sudo hdparm -tT “$1” sudo hdparm -tT --direct “$1” } | awk ‘/Timing cached/ { c_total=$4; c_time=$7; c_ speed=$10 } /Timing buffered/ { b_total=$5; b_time=$8; b_ speed=$11 } /Timing O_DIRECT cached/ { dc_total=$5; dc_ time=$8; dc_speed=$11 } /Timing O_DIRECT disk/ { db_total=$5; db_time=$8; db_speed=$11 } END { printf “%s %s %s %s %s %s %s %s %s %s %s %s\n”, c_total, c_time, c_speed, b_total, b_time, b_speed, dc_ total, dc_time, dc_speed, db_total, db_time, db_speed }’ }

The script begins by writing a header row to identify the data samples that follow. It then repeats the tests 10 times, each launching hdparm twice – with and without the direct option. The results of each test are presented as one output row formed of 12 data values delimited by whitespace which is the format that

gnuplot works with. You can run the script for each device you want to benchmark: $ ./hdparm_awk /dev/nvme0n1 > hdparm-raw-plainnvme0n1.log $ ./hdparm_awk /dev/sda > hdparm-raw-plain-sda.log

You can then use gnuplot to produce a benchmark bar chart from those log files. You can write a gnuplot script for this task: #!/usr/bin/gnuplot -c FILES=ARG1 COLS=ARG2 set terminal png size 800,600 noenhanced set output ‘benchmark.png’ set style data histogram set style histogram gap 1 set style fill solid border -1 set boxwidth 0.8 set style histogram errorbars set key on autotitle columnhead label(s)=substr(s, strstrt(s, ‘-’)+1, strstrt(s,’.log’)-1) columnheading(f,c) = system(“awk ‘/^#/ {next}; {print $”.c.”;exit}’ “.f) do for [f in FILES] { set print f.’.stats’ print label(f).’ mean min max’ do for [i in COLS] { stats f using 0+i nooutput print columnheading(f,i).’ ‘, \ STATS_mean, STATS_min, STATS_max } unset print }

plot for [f in FILES] f.’.stats’ \ using 2:3:4 title columnhead(1), \ ‘’ using (0):xticlabels(1) with lines

Assuming no prior gnuplot experience, a little explanation is in order. The first line is the usual shebang, which is what enables you to run it from the command line. The -c argument tells gnuplot that arguments may follow which gnuplot makes available to the script as ARG1 , ARG2 and so on. Next, some settings prepare the output file and style the chart. A couple of helper functions follow: label extracts a substring from the log file’s name to use as a chart label, and columnheading is self-explanatory – it reads a column’s heading from the log file.

The first loop generates statistical values from the input data: average (mean), minimum and maximum values are written to new files which the second loop uses to plot a bar graph of averages with minimummaximum error bars. The script expects two arguments, each a space-delimited string, a list of log files and a list of column numbers: $ plot_chart.gp “$(ls *.log)” ‘3 6 12’ This would chart the data in columns three, six and 12 of the given files. The script is really a starting point that you could take further, perhaps labelling the axes or styling differently. There’s plenty of documentation available at https://gnuplot.org or you can seek out the second-edition book GnuplotinAction by Philipp K. Janert (Manning Publications) to learn more about

gnuplot’s capabilities.

Destroyer of Disks

The data dump utility dd copies data from one place to another. It can be used to benchmark simulated streaming; continuous writing of large data blocks. The basic command for such a sequential write test is shown below: dd if=/dev/zero of=~/testfile bs=1M count=1K conv=fdatasync Here, we give an input with little-to-no overhead ( if=/

dev/zero ), a temporary output file on the file system to be tested ( of=~/testfile ), a block size ( bs=1M ) and number of blocks ( count=1K ) for a total write of 1GB which is a reasonable size to test with. You can use larger sizes, but the block size can’t exceed the amount of memory you have. You also need sufficient free space on the device being tested to accommodate the temporary file.

Sizes are specified in bytes unless another unit is specified. Here we’re using binary units based on powers of two (you may use other units – see man dd). The final parameter conv=fdatasync waits for all data to be written to the disk. A typical result obtained using this command might look like this: 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 14.5366 s, 73.9 MB/s

If you were to omit the sync argument then the speed reported would be wildly misleading, perhaps 1GB per second, revealing only how quickly the Linux

kernel can cache the data which is, by design, fast. The command would complete before the data is completely written and the result would not therefore represent the true write speed. Sometimes having the command complete as soon as possible is most desirable, but not when benchmarking.

The sync argument requests dd issue a sync system call to ensure the data it wrote has been committed to the storage device. It could instead request that kernel should sync after writing each block (specify oflag=dsync ), but this would be considerably slower, perhaps less then 10MB per second. Using

conv=fdatasync syncs once, after all data has been written. It’s how most real-world applications would behave and is therefore the most realistic benchmark that dd can provide. You can also bypass the page cache by adding oflag=direct , as long as the target supports it (the ZFS filesystem doesn’t).

You can use dd to compare the performance of a block device with one that’s encrypted, and do that with various file systems in place and also compare them with the raw performance attainable without a filesystem. You first need to prepare the target by

creating an encrypted device mapper (if required) and making a filesystem. As an example we prepare an encrypted ext4 filesystem like this: $ sudo cryptsetup luksFormat /dev/sdX <(echo ‘passphrase’) $ sudo cryptsetup open /dev/sdX dm_sdX $ sudo mkfs.ext4 /dev/mapper/dm_sdX $ sudo mount /dev/mapper/dn_sdX /mnt

Skip the cryptsetup parts when you don’t need encryption and skip the filesystem part when you don’t need a filesystem. It’s worth repeating at this point that these things are destructive and that you shouldn’t do them on devices having precious contents.

You can vary these preparation steps for each target you’d like to test, which we set as a variable $target that we can refer to later. If testing a filesystem then the target is the path to a non-existent temporary file on the mounted filesystem: target=/mnt/testfile Use the device or device mapper path (under /dev) to test the raw device without a filesystem: target=/dev/mapper/dm_sdX

With the preparation done, you can go ahead and run your benchmarks (your user id must be permitted to write to the target, otherwise prepend sudo ): $ sudo dd if=/dev/zero of=”$target” bs=1M count=1K conv=fdatasync 2>&1 | \ awk -F, ‘/copied/ { split($1, bytes, / /) split($3, seconds, / /) printf(“%d %f %f\n”, bytes[1], seconds[2], bytes[1] / seconds[2]) }’

We redirect the standard error ( 2>&1 ) because that’s where dd writes its reports. Redirecting onto standard output enables those reports to pass trough the pipe into awk . As well as providing you with another opportunity to practice your awk-fu, this reports the number of bytes written and the time in seconds taken to write them. A third column presents the speed in bytes per second. You can wrap it all in a loop similar to the earlier example to repeat the test multiple times.

We tested a raw block device and with eight filesystems, 10 iterations each, repeated those tests and fed the 18 log files into gnuplot; we plot the third column which is the reported speed value like so: $ plot_chart.gp “$(ls dd-*.log)” 3

All about the IOPS

Storage benchmarking usually measures what’s known as input/output operations per second (or IOPS), a measure of work done vs time taken.

But IOPS are meaningless in isolation. Overall, performance benchmarks should also consider system configuration, response time (or latency) and application workload. Putting all of this together calls for something more sophisticated and that’s where

fio comes in. This Flexible I/O Tester is maintained and used by the Linux kernel developers and comes with the Torvalds seal of approval: “It does things right, including writing actual pseudo-random contents, which shows if the disk does some de-duplication (aka optimise for benchmarks): http://freecode.com/projects/fio Anything else is suspect, so you should forget about

bonnie or other traditional tools.”

fio is a command-line application that you should be able to install from your distro’s repository. On Ubuntu you would do: $ sudo apt install fio

Fio can simulate different kinds of applications’ behaviour. Our benchmark uses it to measure IOPS for a workload that demands a combination of random and sequential reads and writes. Fio accepts jobs: a collections of parameters chosen to simulate a desired I/O workload, either using command-line arguments or as a job file.

A job file is a text file in the classic INI file layout that presents the same parameters as would be specified as command-line arguments. Excepting a few control parameters that can only be given on the commandline, all parameters may be given either on the command-line or in a job file, but command-line arguments take precedence.

The documentation describes the job file format, but it’s pretty self-explanatory: jobs are defined in sections with their names in brackets and comment lines may begin with # or `;. A special [global] section may define parameters applicable to all jobs.

A basic command line might look like this: $ fio --name basic-benchmark --size 1M or as a job file, say basic_benchmark.fio, containing the following: basic benchmark size=1M that you’d run like this: $ fio basic_benchmark.fio

Both methods produce the same result that it can report in a verbose human-readable format or as something more machine-readable. Its terse semicolondelimited format can be fed to gnuplot to produce benchmark charts.

Output formats are specified using a command-line option (not in a job file) and multiple outputs may be combined: $ fio --output-format=normal,terse basic-benchmark. fio > basic-benchmark.log All output is sent to standard output that we redirect into a file to be queried afterwards, for example using awk as so: $ awk -F\; ‘/^3;/{printf “%s:\t%i read IOPS\t%i write IOPS\n”,$3,$8,$49}’ basic-benchmark.log basic benchmark: 6736 read IOPS 0 write IOPS

The match expression ensures we only interpret the terse output data lines – the terse version number is the first field and we look for version 3. The terse format reports 130 data fields and is described in the Fio HOWTO, but it doesn’t index them and this makes it difficult to work with. However, an index can be found elsewhere on GitHub ( https://git.io/fio-fields 3) and this is most helpful. We’re interested in IOPS for our benchmark which we find in field 8 for reads and in field 49 for writes. Other interesting attributes you may like to investigate include timings, latencies and bandwidth.

Our job file has a series of tests we run in sequence: [global] size=1m rwmix_write=25

wait_for_previous=true filename=/mnt/fiotest.tmp ioengine=libaio [sequential-read] bs=1m rw=read [sequential-write] bs=1m rw=write … # see the full file https://pastebin.com/xhxVjsCi [random-32.4K-read-write] bs=4k rw=randrw iodepth=32

Defaults in the [global] section apply to all jobs in addition to their own settings. The wait_for_previous ensures the jobs run one after the other. They include sequential read ( rw=read ) and write ( rw=write ), and random read ( randread ), write ( randwrite ) and read/ write ( randrw ) tests, which are performed using various block sizes ( bs ) and, lastly, a multi-threaded ( iodepth=32 ) test. Read/write operations are one write for every three reads (expressed as a percentage,

rwmix_write=25 ). We test both buffered and direct (by adding

--direct=1 to the command-line) and repeat for the file systems we’re interested in, with and without LUKS encryption. This is a mere example of how you might benchmark with Fio and use gnuplot to present your results. Fio has myriad options that you can apply to model specific workloads. Its documentation explains them and there are some example job files in its Git repository. And if you would like to learn more about designing charts like ours, look out for issue LXF246.

Fio is complex. Be sure to read both the HOWTO and its main documentation because neither contain all of the information you need to fully understand it. See https://github.com/axboe/fio. You may need to install the user tools for the filesystems that you wish to test and you’ll need cryptsetup if you want encryption. Everything should be in your repo, for example on Ubuntu: $ sudo apt install cryptsetup btrfs-progs zfsutils-linux jfsutils xfsprogs reiserfsprogs

?? ?? Ask dd to sync, but not after every write - once when it’s finished is fine: conv=fdatasync. — Ask dd to sync, but not after every write - once when it’s finished is fine: conv=fdatasync.

?? ?? Sequentially writing a gigabyte with dd yields surprising results! — Sequentially writing a gigabyte with dd yields surprising results!

?? ?? Fio is extremely verbose, but we can use a terse option to extract what we need in a script. — Fio is extremely verbose, but we can use a terse option to extract what we need in a script.

Benchmarking

Pop quiz hot shot: what’s the fastest filesystem? John Lane runs some tests and uses gnuplot to help bring clarity to the storage table…

Newspapers in English

Newspapers from Australia

Benchmarki­ng

Pop quiz hot shot: what’s the fastest filesystem? John Lane runs some tests and uses gnuplot to help bring clarity to the storage table…

Newspapers in English

Newspapers from Australia

Benchmarking