Predicting survivors of the Titanic disaster using AI
Machine learning can be applied to many tasks, including predicting the survivors of the Titanic. Darren Yates shows how to code a solution online using Python on Kaggle.
Machine learning can feel like a hard nut to crack at first. Its combination of computer science, statistics and mathematics can seem impenetrable. Yet, it’s like any skill — you have to start slow, learn the basics and just start doing it. The more you do it, the better you get at it. The machine learning online community is growing constantly and there are plenty of good resources available to help speed up your journey. This month, we’re combining two of those resources — the online platform Kaggle, plus the popular Python programming language, to try and predict the survivors of the Titanic.
KAGGLE
The boom in machine learning has seen numerous data-mining communities spawn in the last decade and Kaggle is probably the largest. What makes Kaggle that little bit special is that it started in Melbourne in 2010. It was also one of the first to crowd-source competitions — they offer prize money for the most accurate model that summarises the relationships within data supplied by third-parties.
However, Kaggle also offers some interesting learning competitions that anyone can enter. The one we’re getting involved with this month is predicting the survivors of the Titanic. The story of the sinking of the world’s most famous ship on its maiden voyage from Southampton to New York with the tragic loss of 1,500 passengers and crew is well known. The Kaggle competition ‘Titanic: Machine Learning from Disaster’ ( www.kaggle.com/c/titanic) aims to use machine learning to predict which passengers survived the Titanic. There’s no prize money, but it’s still a great introduction — what’s more, there’s lots of help available at kaggle.com/c/titanic# tutorials.
THE TASK
Our task is to create a set of rules or ‘model’ learned from a set of data called the ‘training‘ dataset, then test that
model against a second set of data called the ‘test’ dataset. The goal is to see how accurately we can predict which of the passengers in the ‘test’ dataset survive the Titanic using just machine-learning. We then upload the model’s prediction results to Kaggle’s leaderboard to see how well we did.
However, rather than use the Weka data-mining app we’ve looked at previously, we’re building this model from scratch on the Kaggle website itself using the Kernel Notebook.
GETTING ON KAGGLE
To take part in this competition, you need to sign up to Kaggle — it’s free with no ongoing costs. Once you’ve signed up, you’ll see a menu list across the top. Click on Kernels and when the Kernels page appears, click on the ‘New Kernel’ button near the top-right of screen. This will allow you to create a machine learning script Kaggle calls a ‘kernel’. You’ll now get the choice of two kernel types — Script or Notebook. A ‘script’ executes once top to bottom, whereas a ‘notebook’ is a collection of mini-scripts or ‘cells’ you can run individually. Select ‘notebook’.
LOADING TITANIC DATASETS
The Notebook editor gives you two main panels initially — a large code editor on the left and control panel on the right with tabs for data, settings and versions. First up, we need to grab the two Titanic datasets, so click on the ‘Add a Data Source’ button on the right and when the Add Data Source window appears, type ‘Titanic’ into the search bar, click on the Competitions menu option, then select the ‘Titanic: Machine Learning from Disaster’ entry. This brings you back to main Notebook editor window, with the datasets now listed. Click on the ‘train.csv’ file in the Data tab and you’ll get a file summary, including the filepath ‘../input/train.csv’. You use this filepath to import the dataset(s) into your code.
WHAT WE’RE DOING
There are any number of ways to write an algorithm to determine the survival of passengers in the ‘test’ dataset, but to keep things manageable, we’ll work our way through a simple implementation that gives reasonable results. Once the algorithm is complete, we’ll create a CSV file with two columns — passenger ID and their survival state (1 = saved, 0 = not) — and submit that to the Kaggle Titanic competition and see how we do.
IMPORTING DATASETS
import warnings warnings. filterwarnings(‘ignore’) from sklearn import tree from sklearn.cross_ validation import train_test_ split
train = pd.read_ csv(“../ input/train.csv”)
test = pd.read_ csv(“../input/ test.csv”)
fullset = train.append(test, ignore_index = True, sort=True)
print(“Done.”)
To run a Notebook cell, click inside the cell, a green ‘play’ button appears in the left-side boarder. Click it and the cell code runs. For the code above, the first two lines import the ‘warnings’ library and turns off the warning messages that pop up fairly often. The ‘sklearn’ library is Python’s excellent scikit-learn library that provides mountains of machinelearning tools. The two we import are the Decision Tree (tree) function and the ‘train_test_split’ function from the ‘cross validation’ sub-library. After that, we read the train.csv file into a dataframe called ‘train’ and the test.csv file into a dataframe called ‘test’. A dataframe is two-dimensional list or array much like a spreadsheet that can contain lists of data of different types, for example, a list of ages (integers) and a list of days of the week (strings).
Once loaded, we want to combine these datasets into one complete dataset to make it easier to fix or ‘clean’ the data, so we append the ‘test’ dataframe to the ‘train’ dataframe and store it in a new dataframe called ‘fullset’. When that’s done, we print ‘Done.’ to the screen, as Notebook cells don’t give obvious indication when they’re complete.
DATASET ATTRIBUTES
We’ve mentioned it before, but just to recap, a dataset is typically a spreadsheet, where the rows are independent events and the columns are features or ‘attributes’ of those events. If we look at the train.csv dataset, each row is a passenger and you’ll see there are 12 attributes: PassengerId — the identifier of the passenger Survived — the class attribute,