Improving your Kaggle Titanic scores

Understanding your data better can make a big difference to your results. Darren Yates looks at ways to improve classification accuracy scores on Kaggle’s Titanic competition.

2018-10-01 -

Last month, we introduced you to Kaggle, the Australiancreated website for data scientists and machinelearning enthusiasts alike to learn, play and solve global problems with data via the site’s many competitions. We launched into the ‘Titanic: Machine Learning from Disaster’ competition as a great way to take baby steps into the world of machinelearning. The task is to try and predict from their various attributes, which of a group of passengers survived the ill-fated ship’s maiden voyage. Our initial run-through last time got us started with Kaggle and, in particular, using its online ‘notebook’ machine-learning code editor. We built our first model and submitted our results to the Titanic competition, achieving a score of 0.69856 (out of 1.0), which placed us in 9,772nd position out of 10,449 entries. This month, we’re looking to improve the model’s classification accuracy and move up that leaderboard a bit!

THINKING IN DATA

The Titanic competition leaderboard is a continually moving target constantly updated with new entries, but even so, a score of 0.69856 and 9,772nd position isn’t far off the ‘wooden spoon’ award. Still, our first model was a pretty simple affair — we built a single decision tree classifier using just the passengers’ age and gender to determine whether or not they survived. If you remember, we said previously that tossing a coin will get you 50% accuracy (0.5), so a score of 0.69856 is a fair improvement, but being in 9,772nd position shows we have many other entries ahead of us on the leaderboard.

The Titanic dataset is basically a two-page spreadsheet — one partition contains the 891 passengers of the ‘train’ dataset and a second part called the ‘test’ dataset has another 418 passengers. Over 2,200 passengers were on the ship when it sank, so these two datasets account for a little more than half of passenger numbers. The difference between the two sets is that

we know the outcomes of the 891 passengers in the train dataset; we’re not told how the passengers fared in the ‘test’ dataset, although the Kaggle team knows the outcome. Our job is to learn a set of rules or ‘model’ from the ‘train’ dataset that we can then apply to predict the fate of the passengers in the ‘test’ set. Both the ‘train’ and ‘test’ dataset partitions contain 11 features or ‘attributes’ of each passenger:

PassengerId — the identifier of the passenger

Pclass — the class of fare purchased by the passenger

Name — the passenger’s name Sex — the passenger’s gender Age — the passenger’s age

SibSp — the number of siblings or spouses also on board

Parch — the number of parents or children also on board

Ticket — the passenger’s fare ticket number Fare — the price paid for travel Cabin — the passenger’s cabin number

Embarked — location at which the passenger boarded

The ‘train’ dataset has a 12th attribute called ‘survived’ — whether the passenger survived (1) or not (0).

History tells us those rescued were predominantly women and children, so basing our initial decision tree classifier using the ‘sex’ and ‘age’ attributes of the passengers wasn’t a silly idea. However, given we only scored 0.69856 shows these two attributes alone don’t tell the full story.

Kaggle allows users to publish their models — what Kaggle calls ‘kernels’ — along with explanations on why certain options were chosen. There are some very generous ‘Kagglers’ who have offered some brilliant explorations, backed up by statistical modelling — we’ve made a small list of some must-read kernels at the end of this article. The ideas we’ve used here commonly feature on Kaggle and give you a guide of how to go about improving classification accuracy.

DATA CLEANING

A common problem you’ll face with many real-world datasets is that of missing data. It’s a problem because machine-learning algorithms tend not to work well with what they can’t see. For starters, both the ‘train’ and ‘test’ datasets have records missing values for the ‘Age’ attribute — 177 in the training set, 86 in the test set. Last time, we raced through this and simply used Python’s fillna() function to fill missing Age attribute values with ‘zero’. It’s simple, but not necessarily the best move, since we’re effectively changing the distribution of passengers by adding 263 newborns.

Another option we could’ve taken is to simply delete the records with missing data, but that’s a sledgehammer approach (we also can’t do this with the ‘test’ dataset, otherwise our submission results will be incomplete). A more reasoned approach is to calculate or ‘impute’ the passenger’s age from other passengers who share the same gender and ticket class (‘Pclass’) and taking an average age from those passengers. However, the easiest solution is to use the median — the age value of the middle record when ordered from lowest to highest. The median can also be used to fill in the one ‘Fare’ record value missing.

ATTRIBUTE SELECTION

As we’ve said, we used only two attributes last time — ‘sex’ and ‘age’, yet there are 11 attributes available to us. One thing we could do straightaway is simply add in the other attributes and create a new model — after all, the more data we can use, the better our model, right? Well, actually, no — not always. Of the 11 attributes in the ‘train’ dataset, ‘Cabin’ is missing 687 out of 891 values — there’s no point in trying to guess them. Also, the ‘Passenger ID’ attribute will be unique for every passenger — there’s no way to categorise them, so they’re of little use, too. The ‘Ticket’ attribute similarly looks random, so it’s of no obvious help. The ‘Name’ attribute at first glance appears similar to ‘Passenger ID’ in that every passenger has a different name, so there’s no obvious way to group them. But that’s not strictly true and we’ll revisit this attribute shortly.

For now, that leaves us with seven attributes — Pclass, Sex, Age, SibSp, Parch, Fare and Embarked.

What we can do now is build a new model from these seven attributes, submit the results to Kaggle and see how we go. This we did and our score improved to 0.71770, but this only moved us up from 9,772nd to 9,604th position. Clearly, there’s more to do.

ATTRIBUTE ENGINEERING

Remember the ‘Name’ attribute? While every passenger obviously has one, each name value also hides information we can group on. For starters, it hides the title of each passenger. The vast majority have standard titles of ‘Mr’, ‘Mrs’, ‘Miss’ or ‘Master’, but there are also rarer titles such as ‘Sir’, ‘Dr’, ‘Countess’ and so on. To get better access to this data, we can create or ‘engineer’ a new attribute

called ‘Title’ that builds five categories — one each for ‘Mr’, ‘Mrs’, ‘Miss’ and ‘Master’, plus another for the ‘rare’ titles.

We use Python’s string split function to split off the left and right sides at particular points of each name value to leave us with just the passenger’s title:

fullset[‘Title’] = fullset[‘Name’].str.split(“, “, expand=True)[1].str.split(“.”, expand=True)[0]

The ‘data cleaning and attribute engineering’ code block in our source code (details below) and the codelines starting with “fullset[‘Title’]” deal with creating this attribute.

The ‘SibSp’ and ‘Parch’ attributes tell us the number of siblings and spouses, parents and children each passenger had — from this, we can build another attribute called ‘FamilySize’, denoting the size of the family the passenger is from. (Again, because of space limitations, we’re cutting across how the importance of this data was discovered, read the kernels at the end of this story for more). The codeline for this is:

fullset[“FamilySize”] = fullset[“SibSp”] + fullset[“Parch”] + 1

Now we can build another decision tree model, this time, replacing the ‘SibSp’ and ‘Parch’ attributes with ‘FamilySize’ and adding in the ‘Title’ attribute — that gives us Age, Sex, Pclass, Fare, Title, Embarked and FamilySize. Again, we submitted the results of the model to Kaggle and our accuracy score rose to 0.73205, moving us to 9,388th position.

“Decision tree algorithms are great for machine-learning for many reasons, but one of the main ones is that they’re easy to use.”

ONE TREE OR MANY TREES?

Decision tree algorithms are great for machine learning for many reasons, but one of the main ones is that they’re easy to use — if you can follow a roadmap, you can generally read a decision tree. The problem, though, is that one decision tree can only represent one view of the data — the dominant view. For other lessdominant but informationally-rich views, one tree isn’t enough. We looked briefly a couple of issues ago at the concept of ‘decision forests’ that combines multiple decision trees together. One of the many algorithms available to Python is ‘Random Forest’, a brilliant algorithm developed by Leo Breiman in 2001 — it’s quite fast, can offer high accuracy and can build a forest of decision trees as large as you need, to cover many different data views. If you’re serious about understanding machine learning, this is one of the classification algorithms you should get a decent handle on (we’ll look at it in a future masterclass).

By simply replacing the ‘decision tree’ algorithm with ‘random forest’ instead, we can broaden our model to consider different possibilities.

FINAL KAGGLE SUBMISSION

The proof of the power of Random Forest is in its results. Again, we build our model, produce our results and submit them back to Kaggle. This time, however, the combination of attribute engineering and changing from ‘decision tree’ to ‘random forest’ has boosted our score from 0.73205 to 0.81339. It might not seem like a huge gain, but in terms of accuracy, it’s actually a pretty decent boost. What’s more, it hasn’t hurt us on the Kaggle leaderboard, either — from 9,388th position, we’ve rocketed up more than 9,000 paces to 578th spot and sitting inside the top 6% of results.

HOW YOU DO IT

Now that we’ve explained what we did, for those who missed last month’s masterclass, here’s a brief ‘how to do it yourself’. First, sign up to Kaggle ( www.kaggle.com) — it’s free. Next, across the top menu, select ‘Kernels’, then click on the white ‘New Kernel’ button near the top-right of the fresh ‘Kernels’ page. You’ll then be asked to select the kernel type — last month, we selected ‘notebook’; this time around, we’re going for ‘script’. The difference between the two is that a notebook allows you to run your code in sections or ‘cells’, whereas a ‘script’ is a topdown complete run-in-one code editor. That suits us more this time.

Once the new Script editor appears, you’ll see a ‘Data’ groupbox on the right side with an ‘Add Dataset’ button. Press it. On the new ‘Add Data Source’ window, click on the ‘Competitions’ menu list and select ‘Titanic: Machine Learning from Disaster’ — this will load the dataset and eventually bring you back to the editor. Now head over to our website at www.apcmag.com/ magstuff and download the Titanic competition script file ‘titanic_script. zip’. Unzip it and copy the contents of the ‘titanic_script.py’ file into the script editor. Click at the top left and enter a name for your script, click back in the main text area and the blue ‘Commit & Run’ button should light up. Press it. It’ll grey out while your code executes and the output file is created. When it colours blue again, press the doubleleft arrow in the top-left of the window. This takes you back to a summary window of your script. Under the header, you’ll see a horizontal menu. If your code has executed correctly, you’ll see an ‘Output’ menu option. Click on it. Your output file ‘titanicModelPrediction.csv’ will appear on the left. To the right, you’ll find a ‘Submit to Competition’ button. Press it once and you’ll automatically send that file for scoring. Within a few seconds, you should get your score back. The random selection method we’re using for training records means we generate a slightly different model each time it runs. This also varies the resulting dataset file you send back so you may seem some variation in the accuracy result.

One thing to remember — Kaggle allows you 10 submissions per competition per day. If you’re on the Australian east coast, that number resets at 10am each morning.

WHAT TO READ

Kaggle allows users to publish their ‘kernel’ code on the Kaggle.com website. If you’re interested to learn more, but a little stuck on where to start, we’d suggest reading through these kernels:

Exploring Survival on the Titanic — kaggle.com/mrisdal/exploringsurvival-on-the-titanic

A Data Science Framework: To Achieve 99% Accuracy — kaggle.com/ ldfreeman3/a-data-scienceframework-to-achieve-99-accuracy

Titanic Data Science Solutions — kaggle.com/startupsci/titanicdata-science-solutions

HAVE A GO

So, there you go. We started this month languishing in the bottom 6% of the Titanic leaderboard at 9,772nd and we finish it inside the top 6% at 578th spot — that’ll do. Because of the way Kaggle calculates scores for the public leaderboard, we could spend another six months chasing a last few percentage points, but it’s time for us to move on and look at other areas. Meanwhile, the best way to learn is by doing. See you next time.