Linux Format

Functional graphing

John Lane can make data shine without touching a mouse. It's just one of the Linux specialist's many skills that make his dinner parties so memorable.

-

When there’s data to interpret and no one else can help, don’t lose the plot! If you can find him, maybe you can hire John Lane for a crash-course on GnuPlot.

They say a picture is worth a thousand words, something you’ll appreciate if you need to compare buckets of numerical data. We did this in LXF243, when we made storage benchmarks, and we used gnuplot to do it. In this tutorial we'll show how you can use gnuplot to plot and chart your own data.

gnuplot is a tool for presenting data. It has nothing to do with the GNU Project, the name being a compromise that could quite as easily have been llamaplot (see

gnuplot.info/faq para 1.2). It doesn’t use the GNU GPL, instead preferring its own open source license:

We’ll use gnuplot to present some input data as a “plot” that reveals how it trends and changes, and as a “chart” that illustrate­s difference­s and similariti­es.

You should be able to install gnuplot from your distro’s package repository: sudo apt install gnuplot #Debian, Ubuntu, etc sudo dnf install gnuplot #Fedora, CentOS, etc sudo pacman -S gnuplot #Arch Linux

Plotting data

We’ll need some sample data to illustrate our examples; gnuplot expects data to be supplied in a text file where each sample is a row and a column per sampled parameter. We’ll use the example you can see in the screenshot

(below). It illustrate­s the first few samples (one per row) of the time taken (in seconds) by an algorithm similarly coded in C, Ruby, Python and Perl that computes the mathematic­al constant pi to fifteen decimal places using the Nilakantha infinite series that is due to the 15th century mathematic­ian. However, the data is most relevant when learning about gnuplot rather than the algorithms that provided it.

The initial rows, beginning with a #, are comments that gnuplot ignores. Each samples’ data values are separated by whitespace (tabs and/or spaces) and the first non-comment row is an optional header; its values may be used to label the plot.

This shows how gnuplot expects input data to be structured. The format lends itself well to any kind of sampled data – perhaps periodic samples from various sensors hooked up to your latest Raspberry Pi project?

It’s simple to plot this using gnuplot. Open a terminal and enter gnuplot to launch its interactiv­e commandlin­e interface. You may type commands directly at the

gnuplot> prompt. Like this: plot for [i=1:4] ‘pi-nilakantha-15.dat’ using 0:i with lines title columnhead­er

The plot command is one of gnuplot's main commands. It plots the given data, in this case contained in the file pi-nilakantha-15.dat, using the data from columns 0 for x-coordinate­s and i for y-coordinate­s to plot with lines connecting the data points and a title from the columnhead­er . That’s repeated for values of i between 1 and 4 – for each of the four data columns in the file. Column 0 is the row number, beginning at 0 and incrementi­ng for each row.

The plot shows four lines, one per column, with the row number along the x-axis and the data value along the y-axis. Each line is coloured and a key is displayed using the column names taken from the header row. These are defaults, but gnuplot is flexible and we’ll demonstrat­e how it can to be customised.

We could have specified each of the plotted lines separately instead of using for : plot ‘pi-nilakantha-15.dat’ using 0:1 with lines title columnhead­er, \ ‘’ using 0:2 with lines title columnhead­er, \ ‘’ using 0:3 with lines title columnhead­er, \ ‘’ using 0:4 with lines title columnhead­er This shows how plot accepts a series of commasepar­ated instructio­ns. It shows how you can omit the filename on subsequent uses. Gnuplot replaces the empty string (““) with the previously used filename ( pi-nilakantha-15.dat in this case).

And it shows how you can use the backslash linecontin­uation character ( \ ) to break a statement across multiple lines.

You can also abbreviate many gnuplot commands. The above example may be shortened to p for [i=1:4] ‘pi-nilakantha-15.dat’ u :i w l t columnh which is much less readable and can be confusing when learning, so our examples are written out using their full form. Unfortunat­ely many examples and tutorials, including the gnuplot documentat­ion, use the shorthand form. It’s possible to look up gnuplot abbreviati­ons (https://superuser.com/q/508644), but it isn’t as easy as it could be.

Should you wish, you can annotate the plot with a title and axis labels: set title “Pi Comptation comparison” set xlabel “Execution time (seconds)” set ylabel “Sample number” If these are set before performing the plot they’ll appear on the graph. The title appears above the graph. Variables, functions and macros If you have multiple datasets (we used a few algorithms to calculate pi) you might want to plot all of them. We can store our datasets’ names in a variable: datasets = “nilakantha simpson viete” It’s just a simple string variable, but we can use it to do commands for each of those datasets: do for [dataset in datasets] {

plot for [i=1:4] ‘pi-’.dataset.‘-15.dat’ using 0:i with lines title columnhead­er pause 5 “Displaying ".dataset }

The commands listed between the braces are performed with the variable dataset set to each (whitespace-separated) word within datasets . We use it to build the filename by concatenat­ing substrings with the . operator, which obviously relies on our datasets being stored in appropriat­ely named files. We also use

pause to allow a little time between plots so each one can be seen. This also shows how multi-line blocks (delimited by braces) can be entered: the command-line prompt changes to more> while within a block.

Similar to variables, we may define functions. They enable us to define expression­s that evaluate to a string or numeric result. We can use a function to build a dataset’s filename: file(dataset,extn) = “pi-”.dataset.“-15.”.extn

Here, given a dataset name and a file extension, we return the formatted filename. Functions in gnuplot must take this form: a name and some arguments and an expression that returns a string or numeric value. You can’t program a function using multiple statements (you can’t use a do block) or one that changes settings, uses iterators or other things that would usually make sense in a function. So gnuplot functions are somewhat limited (https://stackoverf­low.com/a/27835753). but can still be useful.

Macros offer an alternativ­e to functions that don’t accept parameters. They replace the macro with its value. We can use a macro to avoid repeatedly typing our plot command: draw_plot = “plot for [i=1:4] file(dataset,‘dat') using 0:i with lines title columnhead­er”

What we have here is a simple assignment of a string to a value. It looks like a variable and that’s really what it is. But if you precede the name with an @ then it’s replaced before being executed. Like this: do for [dataset in datasets] { @draw_plot; pause 5 } And you can nest them, too: draw_plots = “do for [dataset in datasets] { @draw_plot; pause 5 }” This enables you to draw your plots with a simple

@draw_plots . Notice how gnuplot permits using a semicolon to place multiple commands on the same line, which is useful for macros because they can’t contain line breaks. We’ll use functions and macros throughout the tutorial to avoid repeated typing.

Simple stats

The next thing we want to do is chart the data to highlight the difference­s between the datasets. We can use gnuplot’s stats command to produce averages, minima and maxima for them and then plot those on a histogram.

The stats command summarises one or two columns and reports to the terminal. You can see this with stats ‘pi-nilakantha-15.dat' . It also sets variables that you can use in plots – the integrated help explains these: help stats . To demonstrat­e, we could have used

stats to obtain the number of columns as a variable to use in the plot command instead of hard-coding it to 4 . One of a number of variables it sets is called STAT_

columns , so we’d have: stats ‘pi-nilakantha-15.dat’ nooutput; plot for [i=1..STAT_ columns] { ... } The use of nooutput suppresses the default report the command would otherwise produce, so that it only sets its variables.

If you try the above with our example file you may notice a warning about bad data on line 4. This happens

because stats can the process. column The headers warning are may not be data ignored, values or that you can tell gnuplot that there is header data in the file: set key autotitle columnhead The limitation of stats to the first two columns needs to be overcome when, like our data, there are more than two columns that need to be considered. Once again we employ for and using to report one column at a time; we “print” the data we need, one row per column: do for [i=1:4] { stats ‘pi-nilakantha-15.dat’ using i nooutput; print i, STATS_mean, STATS_min, STATS_ max } which produces: 1 1.8755842 1.873858 1.879206 2 21.7233386516­571 21.4641873836­517 22.1631166934­967 3 97.6620037794­113 93.8680853843­689 101.4616339206­7 4 43.4605897188­187 42.7861561775­208 45.4818699359­894 We want to annotate the rows with the original data columns they were derived from so that we can label the charts clearly. But stats doesn’t write a variable containing the column heading (there is no STATS_ columnhead­er ). However, we can write a function: columnhead­ing(f,c) = system("awk ‘/^#/ {next}; {print $”.c.”;exit}’ ".f)

This uses gnuplot’s system command to execute a shell command. In this case we use awk to extract the required column c header from file f . It’s a shame gnuplot can’t do this because shelling out introduces dependenci­es that may not be portable (if you’re concerned about that, gnuplot also runs on MacOS and Windows). The awk command skips all comment lines and then prints the required word from the next row and then exits, returning its output to the gnuplot session. We also need to somehow capture the output from

stats . We can use set print to have gnuplot write to a file instead of the terminal, and all future print output will go to the named file. We can write a header line and use stats as previously described to write the report to the file. We write the dataset name as the header space for column zero so it’s available later on when charting. We can wrap all of that up and do it for all of our datasets: do for [dataset in datasets] { set print file(dataset,‘stats') print dataset.’ mean min max’ do for [i=1:4] { f = file(dataset, ‘dat') stats f using i nooutput print columnhead­ing(f,i).’ ', STATS_mean, STATS_ min, STATS_max

} } unset print

The final statement restores print output to the terminal. We can compress all of that into a macro if we collapse it into one line and end each statement with a semicolon: build_stats = “do for [dataset in datasets] { set print file(dataset,‘stats'); print dataset.’ mean min max’; do for [i=1:4] { f = file(dataset, ‘dat'); stats f using i nooutput; print columnhead­ing(f,i).’ ', STATS_mean, STATS_min, STATS_max } }; unset print”

Charting bars

We can now make a histogram from the stats data which, by default, is clustered. This means each row of data is represente­d by a group of bars with one bar per column, each group separated by a space equivalent to the width of two bars. plot for [i=2:4] file('nilakantha’,‘stats') using i with histogram title columnhead­er

The avid statistici­an may point out that what we have here is a bar chart because it displays categorica­l data whereas a histogram displays historical quantitati­ve data such as people’s ages and is often binned (grouped into, say. age ranges). Histogram is a portmantea­u word meaning historical diagram; however the gnuplot display type that gives the output we need is called histogram so that’s what we use here.

Our chart looks quite bland but it can be improved with styling. We’ll make a few changes to illustrate the possibilit­ies, but also see the documentat­ion. set style data histogram

set style histogram gap 1 set style fill solid border -1 set boxwidth 0.8 We begin by changing the default

data style , which is how plot renders data when we don’t specify it. Making histogram default means we can omit with histogram from our plot command-line. Next we set the gap between the histogram clusters, changing it from the default, 2, to 1. Then we set a solid fill with a solid border using the default line type, solid black (-1). Finally we set the box width to 0.8 to leave a small gap between the bars (the width ranges between 0 and 1).

You may view available line types and other display attributes with the built-in terminal test. Enter test at the prompt. The output is based on the terminal so, if you plan exporting plots to PNG or PDF, configure that first.

Another way to present the data uses error bars : set style histogram errorbars set key off autotitle columnhead plot ‘pi-8.stats’ using 2:3:4 title columnhead­er, \

‘’ using (0):xticlabels(1) with lines

This presents one bar per category representi­ng its mean value, but marks the minimum and maximum extremes on top using an error bar. The key has only one entry (the mean) so we set key off to disable it. We still need to tell gnuplot that there’s a header row, hence retaining autotitle columnhead .

A final embellishm­ent labels the x-axis, but it needs some explaining. This additional instructio­n to plot specifies (0) for the y-axis and, for the x-axis,

xticlabels(1) – a function that sets the tick labels for the x-axis from the first column in the file. We specify

with lines because an unwanted line is drawn along the given y=0 but is obscured by the x-axis and therefore invisible (had we drawn with points you’d see them!).

We can wrap all of this together to display all datasets in a clustered histogram with error bars: plot for [dataset in datasets] file(dataset,‘stats') \ using 2:3:4 title columnhead­er(1),\ ‘’ using (0):xticlabels(1) with lines Define this as a macro so that we can use it later: plot_histogram = “plot for [dataset in datasets] file(dataset,‘stats') using 2:3:4 title columnhead­er(1), ‘’ using (0):xticlabels(1) with lines” It’s a good time to explain how using works in a little more detail; it depends on the data plotting style (from

set style or specified with ). Two-dimensiona­l plots expect, like our first one, a pair of x and y coordinate­s but the y value may be omitted: using 1 is the same as

using 0:1 and produces a plot of row number (the inherent column 0) along the x-axis against the value in column 1 along the y-axis.

The histogram style expects using to be given one value to specify which column to represent as vertical bars reaching up the y-axis. The errorbars style expects additional values – we give a total of 3 to specify the columns containing the column value, and the minimum and maximum extremes of its error bar.

Combining plots

We can take all the plots we have produced so far and use gnuplot’s multiplot feature to bring them together. We must first define an arbitrary grid for the plot layout to use: set multiplot layout 2,2

This defines a 2x2 that can accommodat­e four plots, sufficient for our three dataset plots and the clustered histogram. We just need to plot them in turn: do for [dataset in datasets] { @draw_plot } @draw_chart Finally, close the multiplot: unset multiplot

This is the most basic multiplot usage where plots are output in a left to right and top to bottom order.

help multiplot will reveal how it’s possible to control layout flow or use absolute positionin­g, even overlaying one plot within another. For plotting discrete data we’ve introduced a lot of

gnuplot’s features but there’s plenty more to discover. You can learn a lot from the interactiv­e help alone, but the user manual – available from www.gnuplot.info – is very comprehens­ive although more of a reference than a tutorial. There is also a book, Gnuplot in Action, which is in its second edition having been updated for version five. We found the book an easier read for the novice gnuplotter.

 ??  ?? Gnuplot likes sample data arranged in rows and columns. Comments and a header row are optional.
Gnuplot likes sample data arranged in rows and columns. Comments and a header row are optional.
 ??  ?? Error bars offer a clutter-free way to enrich a plot.
Error bars offer a clutter-free way to enrich a plot.
 ??  ?? The clustered histgram displays multiple measuremen­ts for each sample, but the default rendering is rather bland.
The clustered histgram displays multiple measuremen­ts for each sample, but the default rendering is rather bland.
 ??  ?? Multiplots use a grid system to align and arrange plots.
Multiplots use a grid system to align and arrange plots.
 ??  ?? Adding a little style can help perk up your graphs.
Adding a little style can help perk up your graphs.

Newspapers in English

Newspapers from Australia