APC Australia

Predicting survivors of the Titanic disaster using AI

Machine learning can be applied to many tasks, including predicting the survivors of the Titanic. Darren Yates shows how to code a solution online using Python on Kaggle.

-

Machine learning can feel like a hard nut to crack at first. Its combinatio­n of computer science, statistics and mathematic­s can seem impenetrab­le. Yet, it’s like any skill — you have to start slow, learn the basics and just start doing it. The more you do it, the better you get at it. The machine learning online community is growing constantly and there are plenty of good resources available to help speed up your journey. This month, we’re combining two of those resources — the online platform Kaggle, plus the popular Python programmin­g language, to try and predict the survivors of the Titanic.

KAGGLE

The boom in machine learning has seen numerous data-mining communitie­s spawn in the last decade and Kaggle is probably the largest. What makes Kaggle that little bit special is that it started in Melbourne in 2010. It was also one of the first to crowd-source competitio­ns — they offer prize money for the most accurate model that summarises the relationsh­ips within data supplied by third-parties.

However, Kaggle also offers some interestin­g learning competitio­ns that anyone can enter. The one we’re getting involved with this month is predicting the survivors of the Titanic. The story of the sinking of the world’s most famous ship on its maiden voyage from Southampto­n to New York with the tragic loss of 1,500 passengers and crew is well known. The Kaggle competitio­n ‘Titanic: Machine Learning from Disaster’ ( www.kaggle.com/c/titanic) aims to use machine learning to predict which passengers survived the Titanic. There’s no prize money, but it’s still a great introducti­on — what’s more, there’s lots of help available at kaggle.com/c/titanic# tutorials.

THE TASK

Our task is to create a set of rules or ‘model’ learned from a set of data called the ‘training‘ dataset, then test that

model against a second set of data called the ‘test’ dataset. The goal is to see how accurately we can predict which of the passengers in the ‘test’ dataset survive the Titanic using just machine-learning. We then upload the model’s prediction results to Kaggle’s leaderboar­d to see how well we did.

However, rather than use the Weka data-mining app we’ve looked at previously, we’re building this model from scratch on the Kaggle website itself using the Kernel Notebook.

GETTING ON KAGGLE

To take part in this competitio­n, you need to sign up to Kaggle — it’s free with no ongoing costs. Once you’ve signed up, you’ll see a menu list across the top. Click on Kernels and when the Kernels page appears, click on the ‘New Kernel’ button near the top-right of screen. This will allow you to create a machine learning script Kaggle calls a ‘kernel’. You’ll now get the choice of two kernel types — Script or Notebook. A ‘script’ executes once top to bottom, whereas a ‘notebook’ is a collection of mini-scripts or ‘cells’ you can run individual­ly. Select ‘notebook’.

LOADING TITANIC DATASETS

The Notebook editor gives you two main panels initially — a large code editor on the left and control panel on the right with tabs for data, settings and versions. First up, we need to grab the two Titanic datasets, so click on the ‘Add a Data Source’ button on the right and when the Add Data Source window appears, type ‘Titanic’ into the search bar, click on the Competitio­ns menu option, then select the ‘Titanic: Machine Learning from Disaster’ entry. This brings you back to main Notebook editor window, with the datasets now listed. Click on the ‘train.csv’ file in the Data tab and you’ll get a file summary, including the filepath ‘../input/train.csv’. You use this filepath to import the dataset(s) into your code.

WHAT WE’RE DOING

There are any number of ways to write an algorithm to determine the survival of passengers in the ‘test’ dataset, but to keep things manageable, we’ll work our way through a simple implementa­tion that gives reasonable results. Once the algorithm is complete, we’ll create a CSV file with two columns — passenger ID and their survival state (1 = saved, 0 = not) — and submit that to the Kaggle Titanic competitio­n and see how we do.

IMPORTING DATASETS

import warnings warnings. filterwarn­ings(‘ignore’) from sklearn import tree from sklearn.cross_ validation import train_test_ split

train = pd.read_ csv(“../ input/train.csv”)

test = pd.read_ csv(“../input/ test.csv”)

fullset = train.append(test, ignore_index = True, sort=True)

print(“Done.”)

To run a Notebook cell, click inside the cell, a green ‘play’ button appears in the left-side boarder. Click it and the cell code runs. For the code above, the first two lines import the ‘warnings’ library and turns off the warning messages that pop up fairly often. The ‘sklearn’ library is Python’s excellent scikit-learn library that provides mountains of machinelea­rning tools. The two we import are the Decision Tree (tree) function and the ‘train_test_split’ function from the ‘cross validation’ sub-library. After that, we read the train.csv file into a dataframe called ‘train’ and the test.csv file into a dataframe called ‘test’. A dataframe is two-dimensiona­l list or array much like a spreadshee­t that can contain lists of data of different types, for example, a list of ages (integers) and a list of days of the week (strings).

Once loaded, we want to combine these datasets into one complete dataset to make it easier to fix or ‘clean’ the data, so we append the ‘test’ dataframe to the ‘train’ dataframe and store it in a new dataframe called ‘fullset’. When that’s done, we print ‘Done.’ to the screen, as Notebook cells don’t give obvious indication when they’re complete.

DATASET ATTRIBUTES

We’ve mentioned it before, but just to recap, a dataset is typically a spreadshee­t, where the rows are independen­t events and the columns are features or ‘attributes’ of those events. If we look at the train.csv dataset, each row is a passenger and you’ll see there are 12 attributes: PassengerI­d — the identifier of the passenger Survived — the class attribute,

 ??  ?? We do some basic missing value imputation before splitting the data.
We do some basic missing value imputation before splitting the data.
 ??  ?? All Kaggle notebooks begin with basic code in the first block or ‘cell’.
All Kaggle notebooks begin with basic code in the first block or ‘cell’.
 ??  ?? Iris setosa Copy each source code block into a new cell and click the green play button.
Iris setosa Copy each source code block into a new cell and click the green play button.
 ??  ?? Our model has some over-fitting, but both attributes are important.
Our model has some over-fitting, but both attributes are important.
 ??  ?? We create new dataset splits (the blue + buttons create new cells).
We create new dataset splits (the blue + buttons create new cells).

Newspapers in English

Newspapers from Australia