APC Australia

An introducti­on to R – Part 2

It’s one of the top-two machine-learning languages and easier to learn than you might think. Darren Yates continues our machine-learning masterclas­s with a deeper look at R.

-

Machine-learning is fast becoming the buzzphrase of the age. Few areas of endeavour have escaped its charms over recent years. It’s the subset of artificial intelligen­ce that mixes computer science, statistics and mathematic­s, promising to make sense of the mountains of data we generate every day. All the major software players have cloud-based machinelea­rning services on offer. But while it can all seem quite daunting, the top two machine-learning tools continue to be programmin­g languages Python and R. We introduced you to R last month via Kaggle’s Titanic competitio­n. This month, we begin looking at how to code your own R scripts from scratch.

SETTING UP R ON YOUR PC

If you missed last month’s issue, we’re using a combinatio­n of the R programmin­g language and the RStudio integrated developmen­t environmen­t (IDE). Both are opensource and freely available for Windows, macOS and Linux. We’re using the Windows versions here. To simplify installati­on, you need to install R first, then RStudio. Grab R from the CSIRO mirror at cran.csiro. au and download the free RStudio Desktop from rstudio.com/ products/rstudio/download. When you’ve installed them, fire up RStudio.

GETTING STARTED

The R language may have been designed specifical­ly as a statistica­l language, but in some ways, it’s actually not too dissimilar to Python. R allows you to write code files or ‘scripts’ and its structure or ‘syntax’ has touches of Java and Python about it. An R script file has the suffix ‘.R’, but in practice, it’s just a standard text file. You can create it with Notepad if you like, but we’ll use the RStudio code editor instead.

With RStudio open, you’ll see the code editor panel on the left. To create a new R script at any time, from the main menu, select ‘File > New File > R-Script’ or press Ctrl-Shift-N.

CODING BASICS

Now R supports many of the coding basics you’ll find in other programmin­g languages. For example, you can assign values to variables, you can print, read from and write to files on your system storage, create reusable code-blocks called ‘functions’, execute if-else statements and for-loops. You can also import external libraries or ‘packages’ that contain functions to perform new tasks. You can even take advantage of your CPU’s multiple cores and write code that runs in parallel.

Because it’s designed to deal with lots of data, R also has extra tricks up its sleeve, such as special functions that allow you to process data much faster than standard for-loops.

Let’s start off with variables. Like Python, you do not need to declare a data type when creating a variable – you just assign the value you want to it. For example:

runs <- 6996 innings <- 70 sport <- ‘cricket’

The first two assign an integer (it could also be a decimal-point or ‘real’ number), the third a string of characters. The assignment symbol is created using the left-arrow (<) key

followed by the minus key (-). You can perform maths during assignment, such as:

average = runs/innings print(paste(“Batting average:”, average, “runs per innings”))

R’s print command functions a little differentl­y to Python — you can’t use the addition (+) sign to concatenat­e or join variables. Instead, you use the paste() function with commas (,) separating variables and literals (a literal is a number or text string you want printed as it appears in the code). In the case above, we start by printing the text “Batting average:”, followed by the value stored in the ‘average’ variable and finishing with the text “runs per innings”. You can type these code lines yourself or you can load up the ‘ batting_1.r’ script in this month’s source code pack you’ll find on our website at apcmag.com/magstuff. To run the code, select ‘File > Open File’ choose the file, then press the ‘Source’ button at the top-right of the code editor panel.

IF-ELSE

If you’re familiar with IFTTT, the online conditiona­l-action service ( ifttt.com) , then you already get the idea of conditiona­l statements – if ‘this’ then ‘that’. You can call it ‘code branching’ because the code execution changes or ‘ branches’ depending on whether the condition (the ‘this’ bit) is true or false. Try this example:

testScore <- as. double(readline(prompt=”Enter test score:”)) if (testScore >= 50) { print(“You passed!”) } else { print(“You need to do the course again.”)

}

Here, we’re using the readline() function to get input from the user. We also use the as.double() function to convert the characters into a decimalpoi­nt or ‘double’ value and store it in the variable ‘testScore’. Next we introduce our if-else conditiona­l statement. First, we check the value of testScore and if it’s 50 or greater (>=), we print ‘you passed’, otherwise (else), we print ‘you need to do the course again’. Note that each code block is separated by curly brackets {}. The key to operation here is if the conditiona­l statement (testScore >= 50) is true, then the first code-block runs. If it’s false, the code following the ‘else’ keyword, runs instead. So, if-then-else statement code-blocks are ‘one or the other’ type statements – you can’t execute both. You’ll find this in the source code pack as ‘testscore_1.r’.

FUNCTIONS

Like Python, almost every command in R is essentiall­y a function and you can write your own custom functions to simplify your script structure. If you’re coming from Python, the syntax is a little different, but not confusingl­y so:

your_function <function(parameter_ 1, parameter_ 2, …) { … some function statements … return(some_value) }

This is the simplest form of function structure and starts with your function name, followed by the assignment symbol and the keyword ‘function’. Following the keyword is a list of parameter variables that contain values you want to process in your function. Following this is a right-open curly bracket ‘{‘ signalling the start of your function code. Finally, the last thing you typically have is a return() statement, which returns the value from some variable that you can then do something else with. For example, we can turn our batting average example into a function like this:

calculateA­verage <function(runs, innings) { average <- runs/innings return(average) }

You could also simplify it down to just:

calculateA­verage <function(runs, innings) { return(runs/innings) }

We can incorporat­e the function return straight back into the print statement:

print(paste(“Batting average:”, calculateA­verage(6996,70), “runs per innings”))

You’ll find this version of the code in our source code pack as ‘function_1.R’. Note that the parameters to calculateA­verage() in the print() statement above are just literals — they could be variables as well, in which case, the values of the variables are passed to the function and everything continues as before. Our example is overkill for something this simple, but deliberate­ly so, just so you can see the form and how it works.

VECTORS

So far, each of our variables has held just a single value. However, more useful in machine learning is a sequence of values. In Python, you might call this a ‘list’ and in Java, an ‘arraylist’. In R, the most basic form of list is called a ‘vector’. Vectors are easy to create — here’s an example of five temperatur­e readings in degrees F:

tempsF <- c(32, 67, 80, 88, 212)

The c() function allows us to combine values into a single vector and we store them in the vector ‘tempsF’. Suppose we now want to convert these into a vector of degrees C. We could write a function ‘convertFto­C’ to look like this:

convertFto­C <function(temps) { tempCvec <- NULL for (x in 1:length(temps)) { tempC <- (temps[x] – 32) * (5 / 9) tempCvec <- c(tempCvec, tempC) } return(tempCvec) } print(convertFto­C(tempsF))

The original vector of values is ‘tempsF’, but these values are fed to the function, where they now become the internal ‘temps’ vector. We then use a for-loop to iterate over each element, convert it, store it in the temporary variable ‘tempC’ and add it to the tempCvec vector using the concatenat­ing trick of tacking on each new tempC value to the end of the existing tempCvec vector and adding it back to itself. The source code for this is in the source code pack as ‘temp_1.R’.

FORGET LOOPS

Yet while the code works, for-loops in R are usually quite slow. You won’t notice it in this example, but you would have if the vector had 10 million temperatur­es to convert. However, we can do the same thing much faster without a for-loop using the lapply() function in the file ‘temp_2.R’:

tempsC <- lapply( tempsF, function(x) {return( (x32)*(5/9) )} )

The lapply() function allows you to iterate over a vector or ‘list’, performing tasks on each element in that vector. The function also returns a list with the same length as the input vector. So starting on the right-side of the assignment, the input vector is the original ‘tempsF’ vector and following that we have an anonymous function, ‘function(x)’. The ‘x’ is a temporary variable holding each value from vector tempsF in turn. We take each value, convert it to degreesC and return that value back to the new vector ‘tempsC’. The lapply() function automatica­lly calculates the input vector’s element count so we don’t have to. The result is a much cleaner and faster function.

GIVE IT A GO

If you’ve coded before, then at least some of this probably looks familiar. You might even be wondering when we’re going to get to some real machine-learning. Like most things, there’s little point rushing in before we’re ready, so it’s important to get a feel for R with the basics first. Next month, we crank things up a notch by looking at R’s nuts and bolts for machine learning in matrices and dataframes, as well as how to use them with the apply() functions. In the meantime, have a play around with RStudio!

 ??  ?? Functions are excellent for creating neat and reusable code.
Functions are excellent for creating neat and reusable code.
 ??  ?? Use the left-arrow assignment symbol to assign values to variables.
Use the left-arrow assignment symbol to assign values to variables.
 ??  ?? The if- else conditiona­l statement allows you to branch your R code.
The if- else conditiona­l statement allows you to branch your R code.
 ??  ?? R script files are just text files – you can even write them in Notepad.
R script files are just text files – you can even write them in Notepad.
 ??  ?? The better way to work on vector values is to use the lapply() function.
The better way to work on vector values is to use the lapply() function.
 ??  ?? A vector is a sequence of like values you can iterate over using a for-loop.
A vector is a sequence of like values you can iterate over using a for-loop.
 ??  ?? Grab the R programmin­g language for your system — install it first!
Grab the R programmin­g language for your system — install it first!
 ??  ?? Once the R language is installed, download and install RStudio desktop.
Once the R language is installed, download and install RStudio desktop.

Newspapers in English

Newspapers from Australia