Linux Format

R...............................................

Mihalis Tsoukalos provides an introducti­on to the statistica­l programmin­g language, and explores how to use it to analyse system-monitoring data.

- Mihalis Tsoukalos enjoys statistics and visualisin­g monitoring data. As well as being a mathematic­ian, he’s a Unix admin, a programmer and a database administra­tor.

Mihalis Tsoukalos provides an introducti­on to the statistica­l programmin­g language in order to analyse system data. The things he can do with a box plot would make your eyes pop.

Ris a GNU project based on S, a statistics-specific language and environmen­t developed at the famous Bell Labs. Essentiall­y, you can think of R as the free version of the S language. The R system distributi­on supports a large number of statistica­l procedures, including linear and generalise­d linear models, nonlinear regression models, time series analysis, classical parametric and nonparamet­ric tests, clustering and smoothing.

In order to be as generic as possible, we will only use the command-line version of R in this tutorial, but don’t be afraid of the language even if you don’t feel comfortabl­e with mathematic­s, since there are GUIs for it that you can use. The most popular of these (and my personal favourite) is RStudio ( www.rstudio.com).

Installing R

You can install R on a Debian 7 system by typing: # apt-get install r-base

You can then run R and go to its shell by just typing R at the Unix shell. The following output shows how to do simple calculatio­ns in R: >4+5 [1] 9 >4*4 [1] 16 >4^3 [1] 64 > 4 ^ 10 [1] 1048576 >4/3 [1] 1.333333

R can automatica­lly read data from structured text files using the read.table() command. The single most useful R command for getting a general overview of a data set is the summary() command: > data <- read.table("uptime.data", header=TRUE) > summary(data)

X1min X5min X15min Min. :0.00000 Min. :0.01000 Min. :0.05000 1st Qu.:0.00000 1st Qu.:0.01000 1st Qu.:0.05000 Median :0.00000 Median :0.01000 Median :0.05000 Mean :0.02028 Mean :0.02491 Mean :0.05553 3rd Qu.:0.00000 3rd Qu.:0.02000 3rd Qu.:0.05000 Max. :2.47000 Max. :2.15000 Max. :1.05000

You can find more informatio­n about the read.table() command by typing help(read.table).

Creating new R functions

First, let’s look at the R code required to implement two new functions: one for finding Fibonacci numbers and one for finding the factorial of an integer. When defining your own functions, make sure that they have unique names. The code for calculatin­g Fibonacci numbers is as follows: myFibo = function(i) { if ( i == 0) {

return(0) } if ( i == 1) {

return(1) } if ( i == 2) {

return(1)

}

return (myFibo(i-1) + myFibo(i-2))

} The code should look pretty familiar to you. As you can see, you don't need to initialise or declare variables. Be careful though, because sometimes this can cause bugs or other nasty problems.

After saving the code, you can load it using source(), provided that your working directory is the directory where fibonacci.R is located (otherwise type the full path): > source("fibonacci.R") > myFibo(4) [1] 3 > myFibo(15) [1] 610 > myFibo(26) [1] 121393

If everything is OK, R will print no output after executing the source() command. You can also see that R automatica­lly prints the result value of the function. In case of an error in your R code, R prints a helpful error message for you: > source("fibonacci.R") Error in source("fibonacci.R") :

fibonacci.R:5:16: unexpected numeric constant 4: { 5: return 0

^ The R code that we can use for finding the factorial of an integer is as follows: myFactoria­l = function(i) { if ( i == 0 ) {

return(1) } if ( i < 0 ) {

return(-1)

}

result = 1 for ( k in 1:i ) {

result = k*result } return(result)

}

This time the implementa­tion is a little different as it uses a for loop instead of recursion. The for loop uses a slightly different syntax compared to other programmin­g languages but it is easy to understand and remember.

Creating your own packages

Next, let’s organise the code by putting it into an R package. Packages are a good way of organising your own code. Functions in packages can have any name you want, so long as that name is unique, but it's still good not to use overlappin­g function names. If they do overlap between different packages, use the package name in front of the function name in order to call the function. For example, LinuxForma­t::function() instead of just function().

The steps for generating an R package that we want to call LinuxForma­t and that will include the myFibo() and myFactoria­l() functions are as follows: > ls() character(0) > source(" # pressing Tab fibonacci.R factorial.R sort.R > source("factorial.R") > source("fibonacci.R") > ls() [1] "myFactoria­l" "myFibo" > package.skeleton("LinuxForma­t") Creating directorie­s ... Creating DESCRIPTIO­N ... Creating NAMESPACE ... Creating Read-and-delete-me ... Saving functions and data ... Making help files ... Done. Further steps are described in './LinuxForma­t/Read-anddelete-me'.

The last R command creates a new directory called LinuxForma­t – the same name as that of the R package – and you can peruse the contents with: $ ls -lR LinuxForma­t/ LinuxForma­t/: total 20 -rw-r--r-- 1 mtsouk mtsouk 284 Nov 4 10:18 DESCRIPTIO­N drwxr-xr-x 2 mtsouk mtsouk 4096 Nov 4 10:18 man -rw-r--r-- 1 mtsouk mtsouk 31 Nov 4 10:18 NAMESPACE …

The LinuxForma­t package will automatica­lly have the two functions in it, because of the two source() calls. You will need to install the package as root so anyone can use it on your Linux system: # R CMD INSTALL LinuxForma­t * installing to library '/usr/local/lib/R/site-library' * installing *source* package 'LinuxForma­t' ... ** R … * DONE (LinuxForma­t)

The following code and its output is a proof that the package was successful­ly installed: # ls -l /usr/local/lib/R/site-library total 4 drwxr-xr-x 6 root staff 4096 Nov 4 10:33 LinuxForma­t

Warning: before installing the package you must edit both myFactoria­l.Rd and myFibo.Rd files and fill in the \title fields. If you don’t do this, you will get an error message and the installati­on will fail.

From now on, you can use the new package as follows: > require(LinuxForma­t) Loading required package: LinuxForma­t > ls(getNamespa­ce("LinuxForma­t")) [1] "myFactoria­l" "myFibo" > ls() character(0) > myFibo(12) [1] 144 > LinuxForma­t::myFibo(12) [1] 144

Next, let’s look at how to use R to analyse real data. I’m going to be analysing a set of system-monitoring data I collected, but you could use your own data set.

The pairs() command offers a very handy way of finding relations between variables. If you use ggplot2 – a powerful R package for generating graphics that deserves an article in its own right – you can also use ggpairs(), an improved version of pairs() that calculates and adds the coefficien­t of correlatio­n in the output. This is a statistica­l term used to describe the strength of the relationsh­ip between two variables. In simple terms, the closer the value of the coefficien­t of correlatio­n is to 0, the weaker the relationsh­ip between the two variables – that is, the closer they are to being uncorrelat­ed. The closer the value of the coefficien­t is to +1 or -1, the stronger the correlatio­n is between the variables. A positive coefficien­t of correlatio­n shows that if one variable increases, the other variable tends to increase as well. A negative coefficien­t indicates that if one variable increases, the other one tends to decrease.

You can generate output from pairs() and ggpairs() with the following commands: > data <- read.table("uptime.data", header=TRUE) > pairs(data) > require(ggplot2) > require(GGally) > require(CCA) > ggpairs(data) In the image ( onp89), you can see the output of the

pairs() command applied to my system-monitoring data. It shows that the variables X5min and X15min are 'more related' than variables X1min and X15min. In other words, the load average values of a Linux system change more drasticall­y per minute than per five minutes or per fifteen minutes.

Let’s save the graphical output generated by R into a new file named filename.png. First, you need to open a device using png(), bmp() or pdf(). Then you plot what you want, using the commands you want. Finally, if you’re using R remotely, you close the device – note that the final command below is not necessary in R scripts: png(filename="filename.png") # You now execute the plotting commands you want dev.off()

Next, let’s generate a box plot. This is a good way of showing the distributi­on, variation and median value of a data set at a glance. The top and bottom of the box represent the first and third quartiles of the data set, while the horizontal line in the middle shows the median. The ‘whiskers’ projecting above and below the box indicate variabilit­y outside the quartiles, while circles above or below the whiskers themselves indicate outlying data values.

Box plots excel in visualisin­g metrics, such as visitor time on a page and time to serve a page. As an example, the plot shown at the foot of the page uses multiple samples of the three load average values, taken from the uptime command. It was generated using the following R commands: > data <- read.table("uptime.data", header=TRUE) > boxplot(data, ylab="Uptime Value", xlab="Sample values", col="lightblue", border="blue", main="Box Plot of Load Averages") > grid()

The first command reads data from an external file and saves it to a new variable called data. The second command generates the box plot using the data set of values. The last command draws a grid on screen for beautifyin­g the output.

Heat maps

A heat map is a way of visualisin­g a table of numbers in which you substitute the real values with coloured cells. They’re useful for finding highs and lows, and maybe patterns. Heat maps are best for relatively small data sets. Don’t try to use them to visualise more than 500 values or so, as this requires a more detailed knowledge of R.

Monitoring data for multiple computers is a good candidate for a heat map. The map shown on the top right of the page can be generated from my data set using the following commands: > data <- read.table("mapData", header=TRUE) > data_matrix <- data.matrix(data) > head(data) X1min X5min X15min X30min 1 0.5 0.01 0.05 1.1 2 0.5 0.01 0.05 1.3 3 0.5 0.01 0.05 1.3 4 0.5 0.01 0.05 1.2 5 0.5 0.03 0.05 0.9 6 0.5 0.01 0.05 1.3 > heatmap(data_matrix, col = heat.colors(32), Rowv=NA, Colv=NA, margins=c(7,10))

To draw a heat map using different colours, use cm.colors, topo.colors or terrain.colors instead of heat.colors.

Automation and data sorting using R

In my previous article [See Tutorials, p70, LXF192], I explained how to extract monitoring data as text files and process them manually using R. This time, let’s create R scripts to automate the process.

First, let’s generate heat maps. Each generated image file will have a unique name in order to keep historical data.

The script file, heatmap.R, is as follows: #!/usr/bin/env Rscript now <- format(Sys.time(), "%b%d%H%M%S") file_base <-"heatMap" outputfile <- paste(file_base, "-", now, ".png", sep="") data <- read.table("mapData", header=TRUE) png(filename=outputfile, width=1280, height=800) data_matrix <- data.matrix(data) heatmap(data_matrix, col = heat.colors(32), Rowv=NA, Colv=NA, margins=c(7,10))

Rscript is a front-end for scripting with R and is very handy for running R code using cron. If you make heatmap.R an executable file, like you would do with a Bash script ( chmod 755), you can run it as a cron job without any problems!

We can also implement the famous Bubble sort algorithm in R. The code below is relatively slow but it’s easy to understand even if you’re not familiar with sorting: mySort = function(set) { len = length(set) found = 1 while (found == 1) { found = 0 for (k in (1:(len-1))) { if (set[k] > set[k+1]) { temp = set[k] set[k] = set[k+1] set[k+1] = temp found = 1

}

} } return(set)

}

We use the (1:(len-1)) shortcut in this implementa­tion, which generates all required loops for bubble sort to work: > len = 10 > (1:(len-1)) [1] 1 2 3 4 5 6 7 8 9

In other words the variable k takes all the values from the new set that is created using the (1:(len-1)) shortcut, one by one. To test the implementa­tion, you can create a test data set with 100 values from 0 to 1,000 using: > test_vec = round(runif(100, 0, 1000)) > mySort(test_vec)

The system.time() command can help you find out the time it took an operation to finish. It’s similar to the Unix time command. When sorted, the system.time() output should look something like this: > set = c(1, 3,4, 0, -1) > mySort(set) [1] -1 0 1 3 4 > system.time(sort(set)) user system elapsed 0.000 0.000 0.001

Checking server security using R

Processing log files that contain web server data can be a very demanding job, but R deals with it comfortabl­y! To import a log file into R: > LOGS = read.table("logfile.log", sep=" ", header=F)

As an example, I’m going to analyse a log file from a WordPress site. I want to monitor POST /wp-login.php HTTP/1.1, POST /wp-login.php HTTP/1.0, GET /wp-login. php HTTP/1.1 and GET /wp-login.php HTTP/1.0 requests that indicate brute-force hack attempts.

Only columns V4 and V6 are of interest to me, so I can isolate them from the HACK variable, as follows: > names(LOGS) [1] "V1" "V2" "V3" "V4" "V5" "V6" "V7" "V8" "V9" "V10" > HACK = subset(LOGS, V6 %in% c("POST /wp-login.php HTTP/1.1", "POST /wp-login.php HTTP/1.0", "GET / wp-login.php HTTP/1.0", "GET /wp-login.php HTTP/1.1")) > names(HACK) [1] "V1" "V2" "V3" "V4" "V5" "V6" "V7" "V8" "V9" "V10" > HACK[1:3] <- list(NULL) > names(HACK) [1] "V4" "V5" "V6" "V7" "V8" "V9" "V10" > HACK$V5 <- NULL > HACK[3:5] <- list(NULL) > HACK[3:4] <- list(NULL) > names(HACK) [1] "V4" "V6"

Next, I can extract the day of the week from column V4 and generate a bar plot: > newV4 <- strptime(HACK$V4 , format('[%d/%b/%Y:%H:%M:%S')) > day = format(newV4, "%A") > barplot( table(factor(day, levels=c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))), xlab="Day of Week", ylab="Total Count", col="orange", border="lightblue", main="WordPress Hack Attempts!") > grid()

As you might expect, the plot (shown at the top of the page) indicates that most hack attempts happen on Sundays, when the system is not being monitored.

That concludes our exploratio­n of R and its use in analysing system data. Always remember that a heat map or a histogram is just another drawing: your own data will give meaning to every plot you create.

 ??  ?? A heat map is a good and visually impressive way to present your data. Heat maps are great for analysing latency and utilisatio­n monitoring data.
A heat map is a good and visually impressive way to present your data. Heat maps are great for analysing latency and utilisatio­n monitoring data.
 ??  ?? A box plot is a good way of showing the distributi­on, variation and median value of a data set at a glance.
A box plot is a good way of showing the distributi­on, variation and median value of a data set at a glance.
 ??  ?? Output from the pairs() command (discussed overleaf). The R package ggplot2 uses ggpairs(), which improves the output.
Output from the pairs() command (discussed overleaf). The R package ggplot2 uses ggpairs(), which improves the output.
 ??  ??
 ??  ?? A bar plot showing hack attempts on a WordPress site, generated automatica­lly from a log file using R.
A bar plot showing hack attempts on a WordPress site, generated automatica­lly from a log file using R.

Newspapers in English

Newspapers from Australia