Functional graphing

John Lane can make data shine without touching a mouse. It's just one of the Linux specialist's many skills that make his dinner parties so memorable.

2018-11-20 -

When there’s data to interpret and no one else can help, don’t lose the plot! If you can find him, maybe you can hire John Lane for a crash-course on GnuPlot.

They say a picture is worth a thousand words, something you’ll appreciate if you need to compare buckets of numerical data. We did this in LXF243, when we made storage benchmarks, and we used gnuplot to do it. In this tutorial we'll show how you can use gnuplot to plot and chart your own data.

gnuplot is a tool for presenting data. It has nothing to do with the GNU Project, the name being a compromise that could quite as easily have been llamaplot (see

gnuplot.info/faq para 1.2). It doesn’t use the GNU GPL, instead preferring its own open source license:

We’ll use gnuplot to present some input data as a “plot” that reveals how it trends and changes, and as a “chart” that illustrates differences and similarities.

You should be able to install gnuplot from your distro’s package repository: sudo apt install gnuplot #Debian, Ubuntu, etc sudo dnf install gnuplot #Fedora, CentOS, etc sudo pacman -S gnuplot #Arch Linux

Plotting data

We’ll need some sample data to illustrate our examples; gnuplot expects data to be supplied in a text file where each sample is a row and a column per sampled parameter. We’ll use the example you can see in the screenshot

(below). It illustrates the first few samples (one per row) of the time taken (in seconds) by an algorithm similarly coded in C, Ruby, Python and Perl that computes the mathematical constant pi to fifteen decimal places using the Nilakantha infinite series that is due to the 15th century mathematician. However, the data is most relevant when learning about gnuplot rather than the algorithms that provided it.

The initial rows, beginning with a #, are comments that gnuplot ignores. Each samples’ data values are separated by whitespace (tabs and/or spaces) and the first non-comment row is an optional header; its values may be used to label the plot.

This shows how gnuplot expects input data to be structured. The format lends itself well to any kind of sampled data – perhaps periodic samples from various sensors hooked up to your latest Raspberry Pi project?

It’s simple to plot this using gnuplot. Open a terminal and enter gnuplot to launch its interactive commandline interface. You may type commands directly at the

gnuplot> prompt. Like this: plot for [i=1:4] ‘pi-nilakantha-15.dat’ using 0:i with lines title columnheader

The plot command is one of gnuplot's main commands. It plots the given data, in this case contained in the file pi-nilakantha-15.dat, using the data from columns 0 for x-coordinates and i for y-coordinates to plot with lines connecting the data points and a title from the columnheader . That’s repeated for values of i between 1 and 4 – for each of the four data columns in the file. Column 0 is the row number, beginning at 0 and incrementing for each row.

The plot shows four lines, one per column, with the row number along the x-axis and the data value along the y-axis. Each line is coloured and a key is displayed using the column names taken from the header row. These are defaults, but gnuplot is flexible and we’ll demonstrate how it can to be customised.

We could have specified each of the plotted lines separately instead of using for : plot ‘pi-nilakantha-15.dat’ using 0:1 with lines title columnheader, \ ‘’ using 0:2 with lines title columnheader, \ ‘’ using 0:3 with lines title columnheader, \ ‘’ using 0:4 with lines title columnheader This shows how plot accepts a series of commaseparated instructions. It shows how you can omit the filename on subsequent uses. Gnuplot replaces the empty string (““) with the previously used filename ( pi-nilakantha-15.dat in this case).

And it shows how you can use the backslash linecontinuation character ( \ ) to break a statement across multiple lines.

You can also abbreviate many gnuplot commands. The above example may be shortened to p for [i=1:4] ‘pi-nilakantha-15.dat’ u :i w l t columnh which is much less readable and can be confusing when learning, so our examples are written out using their full form. Unfortunately many examples and tutorials, including the gnuplot documentation, use the shorthand form. It’s possible to look up gnuplot abbreviations (https://superuser.com/q/508644), but it isn’t as easy as it could be.

Should you wish, you can annotate the plot with a title and axis labels: set title “Pi Comptation comparison” set xlabel “Execution time (seconds)” set ylabel “Sample number” If these are set before performing the plot they’ll appear on the graph. The title appears above the graph. Variables, functions and macros If you have multiple datasets (we used a few algorithms to calculate pi) you might want to plot all of them. We can store our datasets’ names in a variable: datasets = “nilakantha simpson viete” It’s just a simple string variable, but we can use it to do commands for each of those datasets: do for [dataset in datasets] {

plot for [i=1:4] ‘pi-’.dataset.‘-15.dat’ using 0:i with lines title columnheader pause 5 “Displaying ".dataset }

The commands listed between the braces are performed with the variable dataset set to each (whitespace-separated) word within datasets . We use it to build the filename by concatenating substrings with the . operator, which obviously relies on our datasets being stored in appropriately named files. We also use

pause to allow a little time between plots so each one can be seen. This also shows how multi-line blocks (delimited by braces) can be entered: the command-line prompt changes to more> while within a block.

Similar to variables, we may define functions. They enable us to define expressions that evaluate to a string or numeric result. We can use a function to build a dataset’s filename: file(dataset,extn) = “pi-”.dataset.“-15.”.extn

Here, given a dataset name and a file extension, we return the formatted filename. Functions in gnuplot must take this form: a name and some arguments and an expression that returns a string or numeric value. You can’t program a function using multiple statements (you can’t use a do block) or one that changes settings, uses iterators or other things that would usually make sense in a function. So gnuplot functions are somewhat limited (https://stackoverflow.com/a/27835753). but can still be useful.

Macros offer an alternative to functions that don’t accept parameters. They replace the macro with its value. We can use a macro to avoid repeatedly typing our plot command: draw_plot = “plot for [i=1:4] file(dataset,‘dat') using 0:i with lines title columnheader”

What we have here is a simple assignment of a string to a value. It looks like a variable and that’s really what it is. But if you precede the name with an @ then it’s replaced before being executed. Like this: do for [dataset in datasets] { @draw_plot; pause 5 } And you can nest them, too: draw_plots = “do for [dataset in datasets] { @draw_plot; pause 5 }” This enables you to draw your plots with a simple

@draw_plots . Notice how gnuplot permits using a semicolon to place multiple commands on the same line, which is useful for macros because they can’t contain line breaks. We’ll use functions and macros throughout the tutorial to avoid repeated typing.

Simple stats

The next thing we want to do is chart the data to highlight the differences between the datasets. We can use gnuplot’s stats command to produce averages, minima and maxima for them and then plot those on a histogram.

The stats command summarises one or two columns and reports to the terminal. You can see this with stats ‘pi-nilakantha-15.dat' . It also sets variables that you can use in plots – the integrated help explains these: help stats . To demonstrate, we could have used

stats to obtain the number of columns as a variable to use in the plot command instead of hard-coding it to 4 . One of a number of variables it sets is called STAT_

columns , so we’d have: stats ‘pi-nilakantha-15.dat’ nooutput; plot for [i=1..STAT_ columns] { ... } The use of nooutput suppresses the default report the command would otherwise produce, so that it only sets its variables.

If you try the above with our example file you may notice a warning about bad data on line 4. This happens

because stats can the process. column The headers warning are may not be data ignored, values or that you can tell gnuplot that there is header data in the file: set key autotitle columnhead The limitation of stats to the first two columns needs to be overcome when, like our data, there are more than two columns that need to be considered. Once again we employ for and using to report one column at a time; we “print” the data we need, one row per column: do for [i=1:4] { stats ‘pi-nilakantha-15.dat’ using i nooutput; print i, STATS_mean, STATS_min, STATS_ max } which produces: 1 1.8755842 1.873858 1.879206 2 21.7233386516571 21.4641873836517 22.1631166934967 3 97.6620037794113 93.8680853843689 101.46163392067 4 43.4605897188187 42.7861561775208 45.4818699359894 We want to annotate the rows with the original data columns they were derived from so that we can label the charts clearly. But stats doesn’t write a variable containing the column heading (there is no STATS_ columnheader ). However, we can write a function: columnheading(f,c) = system("awk ‘/^#/ {next}; {print $”.c.”;exit}’ ".f)

This uses gnuplot’s system command to execute a shell command. In this case we use awk to extract the required column c header from file f . It’s a shame gnuplot can’t do this because shelling out introduces dependencies that may not be portable (if you’re concerned about that, gnuplot also runs on MacOS and Windows). The awk command skips all comment lines and then prints the required word from the next row and then exits, returning its output to the gnuplot session. We also need to somehow capture the output from

stats . We can use set print to have gnuplot write to a file instead of the terminal, and all future print output will go to the named file. We can write a header line and use stats as previously described to write the report to the file. We write the dataset name as the header space for column zero so it’s available later on when charting. We can wrap all of that up and do it for all of our datasets: do for [dataset in datasets] { set print file(dataset,‘stats') print dataset.’ mean min max’ do for [i=1:4] { f = file(dataset, ‘dat') stats f using i nooutput print columnheading(f,i).’ ', STATS_mean, STATS_ min, STATS_max

} } unset print

The final statement restores print output to the terminal. We can compress all of that into a macro if we collapse it into one line and end each statement with a semicolon: build_stats = “do for [dataset in datasets] { set print file(dataset,‘stats'); print dataset.’ mean min max’; do for [i=1:4] { f = file(dataset, ‘dat'); stats f using i nooutput; print columnheading(f,i).’ ', STATS_mean, STATS_min, STATS_max } }; unset print”

Charting bars

We can now make a histogram from the stats data which, by default, is clustered. This means each row of data is represented by a group of bars with one bar per column, each group separated by a space equivalent to the width of two bars. plot for [i=2:4] file('nilakantha’,‘stats') using i with histogram title columnheader

The avid statistician may point out that what we have here is a bar chart because it displays categorical data whereas a histogram displays historical quantitative data such as people’s ages and is often binned (grouped into, say. age ranges). Histogram is a portmanteau word meaning historical diagram; however the gnuplot display type that gives the output we need is called histogram so that’s what we use here.

Our chart looks quite bland but it can be improved with styling. We’ll make a few changes to illustrate the possibilities, but also see the documentation. set style data histogram

set style histogram gap 1 set style fill solid border -1 set boxwidth 0.8 We begin by changing the default

data style , which is how plot renders data when we don’t specify it. Making histogram default means we can omit with histogram from our plot command-line. Next we set the gap between the histogram clusters, changing it from the default, 2, to 1. Then we set a solid fill with a solid border using the default line type, solid black (-1). Finally we set the box width to 0.8 to leave a small gap between the bars (the width ranges between 0 and 1).

You may view available line types and other display attributes with the built-in terminal test. Enter test at the prompt. The output is based on the terminal so, if you plan exporting plots to PNG or PDF, configure that first.

Another way to present the data uses error bars : set style histogram errorbars set key off autotitle columnhead plot ‘pi-8.stats’ using 2:3:4 title columnheader, \

‘’ using (0):xticlabels(1) with lines

This presents one bar per category representing its mean value, but marks the minimum and maximum extremes on top using an error bar. The key has only one entry (the mean) so we set key off to disable it. We still need to tell gnuplot that there’s a header row, hence retaining autotitle columnhead .

A final embellishment labels the x-axis, but it needs some explaining. This additional instruction to plot specifies (0) for the y-axis and, for the x-axis,

xticlabels(1) – a function that sets the tick labels for the x-axis from the first column in the file. We specify

with lines because an unwanted line is drawn along the given y=0 but is obscured by the x-axis and therefore invisible (had we drawn with points you’d see them!).

We can wrap all of this together to display all datasets in a clustered histogram with error bars: plot for [dataset in datasets] file(dataset,‘stats') \ using 2:3:4 title columnheader(1),\ ‘’ using (0):xticlabels(1) with lines Define this as a macro so that we can use it later: plot_histogram = “plot for [dataset in datasets] file(dataset,‘stats') using 2:3:4 title columnheader(1), ‘’ using (0):xticlabels(1) with lines” It’s a good time to explain how using works in a little more detail; it depends on the data plotting style (from

set style or specified with ). Two-dimensional plots expect, like our first one, a pair of x and y coordinates but the y value may be omitted: using 1 is the same as

using 0:1 and produces a plot of row number (the inherent column 0) along the x-axis against the value in column 1 along the y-axis.

The histogram style expects using to be given one value to specify which column to represent as vertical bars reaching up the y-axis. The errorbars style expects additional values – we give a total of 3 to specify the columns containing the column value, and the minimum and maximum extremes of its error bar.

Combining plots

We can take all the plots we have produced so far and use gnuplot’s multiplot feature to bring them together. We must first define an arbitrary grid for the plot layout to use: set multiplot layout 2,2

This defines a 2x2 that can accommodate four plots, sufficient for our three dataset plots and the clustered histogram. We just need to plot them in turn: do for [dataset in datasets] { @draw_plot } @draw_chart Finally, close the multiplot: unset multiplot

This is the most basic multiplot usage where plots are output in a left to right and top to bottom order.

help multiplot will reveal how it’s possible to control layout flow or use absolute positioning, even overlaying one plot within another. For plotting discrete data we’ve introduced a lot of

gnuplot’s features but there’s plenty more to discover. You can learn a lot from the interactive help alone, but the user manual – available from www.gnuplot.info – is very comprehensive although more of a reference than a tutorial. There is also a book, Gnuplot in Action, which is in its second edition having been updated for version five. We found the book an easier read for the novice gnuplotter.

?? ?? Gnuplot likes sample data arranged in rows and columns. Comments and a header row are optional. — Gnuplot likes sample data arranged in rows and columns. Comments and a header row are optional.

?? ?? Error bars offer a clutter-free way to enrich a plot. — Error bars offer a clutter-free way to enrich a plot.

?? ?? The clustered histgram displays multiple measurements for each sample, but the default rendering is rather bland. — The clustered histgram displays multiple measurements for each sample, but the default rendering is rather bland.

?? ?? Multiplots use a grid system to align and arrange plots. — Multiplots use a grid system to align and arrange plots.

?? ?? Adding a little style can help perk up your graphs. — Adding a little style can help perk up your graphs.

Functional graphing

John Lane can make data shine without touching a mouse. It's just one of the Linux specialist's many skills that make his dinner parties so memorable.

Newspapers in English

Newspapers from Australia