Code machine learning apps with R

It competes for the title of world’s favourite machine-learning language, but it can also be a little intimidating. Darren Yates introduces the statistical language known as ‘R’.

2018-11-01 -

Ask any 10 machine-learning engineers or data scientists their favourite language for doing machine-learning and you’ll likely get half-a-dozen answers. You’ll get votes for Google’s TensorFlow, votes for Java, even Excel has its fans. But the two languages that always seem to float to the top of any poll are Python and R. Just prior to our new machine-learning masterclass, we recently wrapped up a couple of years looking at Python programming in our ‘Learn to code Python’ masterclass. But while Python is an excellent generalpurpose language that covers many bases, R is much more focussed towards statistics and machinelearning. It’s fast, it has some very clever features and if you’re seriously interested in machine-learning, at least understanding the basics of the language must feature somewhere on your bucket list. This month, we begin our introduction to R, installing and using it, by revisiting the Kaggle Titanic competition.

OPEN-SOURCE SOFTWARE

R is an open-source language, making it highly popular, but like Python, it’s also highly-extensible, which is a geeky way of saying you can almost endlessly expand it with new functions and libraries called ‘packages’. It’s also an interpreted, rather than compiled, language, making it more like Python than C++ and instead of apps, you write ‘scripts’ that perform tasks ‘top to bottom’ in a single run. Despite being an interpreted language, R can still be remarkably quick, by tapping into packages written in other high-speed languages such as C++. It also has a few tricks that enable it to process data much faster than you can with traditional for-loops.

DOWNLOAD AND INSTALL

To begin, R is available for Windows, macOS and Linux (Unix, more broadly) operating systems and you can download the latest version (3.5.1 at time of writing) from the R Project for Statistical Computing at r-project.org via one of its many mirrors around the world (including Australia). Choose the version for your operating system — and yes, there are both 32-bit and 64-bit versions.

Again, R is a bit like Python in that just as Python includes a basic development environment called IDLE, R has its own graphical user interface (GUI) called RGui. However, personally, I have to admit to preferring RStudio — it has many more features that just make it easier write R scripts, from smart code prompts to displaying stored objects. RStudio is also free and open-source and the two — R and RStudio — work incredibly well together.

INSTALL R FIRST

There is just one caveat – during installation, RStudio likes to see an R distribution already installed, as it tries to set up links and working folders within the install process. So, just to make your life easier, ensure

you install R first, before installing RStudio. You’ll find the ‘Open Source License’ version of RStudio Desktop at rstudio.com/products/rstudio/

download. You may also notice an ‘RStudio Server’ version — you won’t need this for our introduction. Just stick with the desktop version. Again, there are versions for Windows, macOS, plus installers for Debian 8 and 9-based Linux distros, as well as Fedora 19+. If you prefer, there are non-install zip versions available as well.

We’ll be working with the Windows version, although, I personally use Windows and Linux versions. An external library I currently need for another project is known not compile correctly in Windows, however, for our introduction, this won’t be a problem. Still, if you find machinelearning grabs your interest, good coding skills won’t go astray.

GETTING STARTED

Once you have the two installations complete, launch RStudio and you’ll be greeted by the full-screen development environment. If you’ve ever coded Python, you might feel what you see is vaguely familiar. First up, the main screen panel will appear as the console. This is where you can run one-off R commands at the ‘>’ prompt. It’s also where you can access numerous example scripts that are bundled with RStudio, so, on the command prompt (‘>’), type demo() and press Enter. This gives you a list of demos. To see what the R language can do, type demo(graphics) then press Enter and follow the console prompts. R includes a number of excellent and easy to use graphics packages built-in and ready to go. The graphics demo has numerous impressive plots and will give you an idea of what’s possible once you get some R language under your belt. You’ll see them appear in the ‘Plots’ panel on the lower-right.

SIMPLE R CODING

As we mentioned, the console can also handle simple one-line coding, so let’s try a few things. Type the following lines, one at a time, each followed by the Enter key: a <- “hello” b <- “from apc magazine” c <- 42 d <- c/2 cat(a, b) if (d < 25) cat(d, “is less than 25”)

The first two lines assign text strings ‘hello’ and ‘from apc magazine’ to the variables ‘a’ and ‘ b’, respectively. The left-arrow (‘<-‘) is commonly used for assignment in R (yes, you can use the more recognised equals-sign ‘=’, however, there are certain cases where it doesn’t have the same result, so use the arrowassignment symbol instead). Following those lines, variables ‘c’ and ‘d’ are assigned the number 42 and the value of variable ‘c’ divided by two, respectively. To join multiple string variables together, you can use the cat() function, by listing each variable separated by a comma as parameters.

The last line is a simple ifconditional statement — in this case, if the value of variable ‘d’ is less than 25, we print the value of d, followed by the string ‘is less than 25’. Because this if-statement exists on a single line, we don’t need to use curly-brackets {} to surround statements. If the code following the if-statement evaluating to ‘true’ is more than a single line, that’s when you surround the code with curly-brackets. If you’ve ever done any coding in C, C++ or Java, this should be pretty familiar.

PREDICTING THE TITANIC WITH R

Obviously, there’s a lot more to learning the R language than just a few example lines of code, but to give you a more practical idea of how R can be used, we’re going to retrace our steps a bit. Over the last couple of months, we’ve been jumping on the Kaggle ‘Titanic: Machine Learning from Disaster’ competition ( kaggle.com/c/titanic) to get a feel for how to use machinelearning; in this case, to predict which passengers survived the sinking of the Titanic. The Python language, along with the excellent ‘scikit-learn’ machine-learning library, were our tools-of-choice on those previous occasions, but this month, we’re going to have another go at the competition, using R via RStudio instead.

GETTING THE TITANIC DATASET

Kaggle’s Titanic competition is fast becoming a ‘hello, world’ moment for many machine-learning beginners and the dataset used for this competition is available as an R package you can download from CRAN (Comprehensive R Archive Network). CRAN houses hundreds of R packages and RStudio can download and install those packages directly using the ‘install.packages()’ function from the console command-prompt. So, fire up RStudio and at the console command-

RStudio and at the console commandprompt, enter these commands, each followed by the Enter key: install.packages(‘titanic’) install. packages(‘randomForest’)

Within a few seconds, the Titanic dataset package will be downloaded and automatically unpacked. The RandomForest classification algorithm library will also be installed. With that, we can now get to work and begin writing our first R script.

WRITING AND RUNNING AN R SCRIPT

To begin writing a script, go to RStudio’s top menu and select File > New File > R Script. This will bring up a new R Editor panel at the top-left of the RStudio frame. Now we went into a fair amount of detail last time in developing a Python script using the RandomForest algorithm for the Kaggle competition, so we’ll be a little more economical this time around. In any case, you’ll find our Titanic R script on the website at apcmag.com/ magstuff. Unzip it and load it into RStudio. To run it, hit the ‘Source’ button at top-right of the editor panel.

SIX SECTIONS

If you look at the source script, we’ve broken it down into six sections. The first section imports the libraries we’ll need into our script – the complete Titanic dataset we’ve just installed, plus the ‘RandomForest’ library. The second section is joining the two subsets, ‘titanic_train’ and ‘titanic_test’, into one complete dataset. The ‘titanic’ library provides the full Titanic dataset as these two separate subsets, ‘titanic_train’ and ‘titanic_test’. The difference between the two is that ‘titanic_train’ has an extra attribute called ‘Survived’, which is whether or not each passenger survived the sinking. The ‘titanic_test’ subset doesn’t have this attribute — our task is to learn from the ‘titanic_train’ subset a set of rules or ‘model’ that summarises the relationship between the various attributes and the ‘Survived’ attribute (also called the class attribute). Then, using that model, we predict the fate of all the passengers in the ‘titanic_test’ subset.

Section 3 is the reason for joining those two subsets — some of the attributes have missing values and in this section, we perform some very simple ‘missing value imputation’ work on the lot — in other words, we’re making statistical decisions as to what we think the missing values should be. There are no perfect rules for doing this and for simplicity’s sake, we make a few assumptions. For starters, for passenger records where a value for ‘Age’ is missing, we use the average or ‘mean’ of the other passengers’ age. The attribute ‘Embarked’ indicates the location at which a passenger boarded the ship. Any passenger that has this attribute value missing we assume to have boarded at Southampton (‘S’).

Our plan is to use the RandomForest ensemble classification algorithm and it requires us to also turn any character-based attribute value into numbers. We do this with the ‘Sex’ and ‘Embarked’ attributes.

With our missing value imputation and numerical conversion done, Section 4 sees us split the full dataset back into its original ‘train’ and ‘test’ subset components. We also remove the ‘Survived’ attribute from the ‘test’ subset we added earlier and restore this subset to its original state.

MODEL BUILDING

Section 5 is where we start to get serious — we use the Random Forest classification algorithm to build our model from the ‘train’ subset. After seeding the random number generator with ‘3801’ (tinyurl.com/ y9bfc5my), we build the model using just the

‘Age’, ‘Sex’, ‘Pclass’ and ‘Embarked’ attributes from the ‘set_train’ subset on the Random Forest algorithm. The model itself is safely stored in the variable ‘model_rFor’.

The next two code lines give us a quick plot on the bottom-right panel. It shows the prediction error-rate for our model — in other words, the proportion of passenger outcomes the model gets wrong from the training subset. The black line shows the overall error rate of around 0.2 (20% error rate). The green and red lines break this down further, showing the error rate for predicting survivors is nearly 0.4 (40%), but for predicting those who perished is well less than 0.1 (under 10%). In other words, the model is much better at predicting those who died than those who survived.

PREDICT AND SUBMIT

Finally, in Section 6, we take the model we’ve just created and test it on the ‘test’ dataset. There are 418 passengers in this data subset and we have no idea whether they survived or not — we use our model to make predictions and store those results in a data-frame (essentially, a multi-dimensional array or ‘vector’) called ‘prediction_result’. The last step is taking that data-frame, formatting and writing it as a .csv file called ‘titanic_rfor_output_v2. csv’. Once the file is generated, you’ll find it sitting in your ‘documents’ folder. The file is already in Kaggle-required format, so all you need to do now is log into your Kaggle account (if you don’t have one, sign up – it’s free), head to kaggle.com/c/titanic/ submit, scroll down, then drag and drop your .csv file into the ‘Upload Submission File’ box on the page. After that, you can add a description and press the ‘Make Submission’ button at the bottom of the page. Within a few seconds, the file will upload and you should get a score back.

HOW DID WE GO?

We scored 0.78947 — that’s much better than the 0.69856 for our first Python run, but not as good as the 0.81339 score from last month. Still, 0.78947 puts us in 3,306th out of 9,542 places or the top 35%. The lower score isn’t R’s fault — our script is pretty basic. Ultimately, this is all just to give you a very tiny taste of what this excellent language is capable of. If the idea of machinelearning and data science appeals to you, R really should feature highly on your bucket list.