Im­prov­ing your Kag­gle Ti­tanic scores

Un­der­stand­ing your data bet­ter can make a big dif­fer­ence to your re­sults. Dar­ren Yates looks at ways to im­prove clas­si­fi­ca­tion ac­cu­racy scores on Kag­gle’s Ti­tanic com­pe­ti­tion.

APC Australia - - Contents -

Last month, we in­tro­duced you to Kag­gle, the Aus­tralian­cre­ated web­site for data sci­en­tists and ma­chine­learn­ing en­thu­si­asts alike to learn, play and solve global prob­lems with data via the site’s many com­pe­ti­tions. We launched into the ‘Ti­tanic: Ma­chine Learn­ing from Dis­as­ter’ com­pe­ti­tion as a great way to take baby steps into the world of ma­chine­learn­ing. The task is to try and pre­dict from their var­i­ous at­tributes, which of a group of pas­sen­gers sur­vived the ill-fated ship’s maiden voy­age. Our ini­tial run-through last time got us started with Kag­gle and, in par­tic­u­lar, us­ing its on­line ‘note­book’ ma­chine-learn­ing code ed­i­tor. We built our first model and sub­mit­ted our re­sults to the Ti­tanic com­pe­ti­tion, achiev­ing a score of 0.69856 (out of 1.0), which placed us in 9,772nd po­si­tion out of 10,449 en­tries. This month, we’re look­ing to im­prove the model’s clas­si­fi­ca­tion ac­cu­racy and move up that leader­board a bit!

THINK­ING IN DATA

The Ti­tanic com­pe­ti­tion leader­board is a con­tin­u­ally mov­ing tar­get con­stantly up­dated with new en­tries, but even so, a score of 0.69856 and 9,772nd po­si­tion isn’t far off the ‘wooden spoon’ award. Still, our first model was a pretty sim­ple af­fair — we built a sin­gle de­ci­sion tree clas­si­fier us­ing just the pas­sen­gers’ age and gen­der to de­ter­mine whether or not they sur­vived. If you re­mem­ber, we said pre­vi­ously that toss­ing a coin will get you 50% ac­cu­racy (0.5), so a score of 0.69856 is a fair im­prove­ment, but be­ing in 9,772nd po­si­tion shows we have many other en­tries ahead of us on the leader­board.

The Ti­tanic dataset is ba­si­cally a two-page spread­sheet — one par­ti­tion con­tains the 891 pas­sen­gers of the ‘train’ dataset and a sec­ond part called the ‘test’ dataset has an­other 418 pas­sen­gers. Over 2,200 pas­sen­gers were on the ship when it sank, so th­ese two datasets ac­count for a lit­tle more than half of pas­sen­ger num­bers. The dif­fer­ence be­tween the two sets is that

we know the out­comes of the 891 pas­sen­gers in the train dataset; we’re not told how the pas­sen­gers fared in the ‘test’ dataset, although the Kag­gle team knows the out­come. Our job is to learn a set of rules or ‘model’ from the ‘train’ dataset that we can then ap­ply to pre­dict the fate of the pas­sen­gers in the ‘test’ set. Both the ‘train’ and ‘test’ dataset par­ti­tions con­tain 11 fea­tures or ‘at­tributes’ of each pas­sen­ger:

Pas­sen­gerId — the iden­ti­fier of the pas­sen­ger

Pclass — the class of fare pur­chased by the pas­sen­ger

Name — the pas­sen­ger’s name Sex — the pas­sen­ger’s gen­der Age — the pas­sen­ger’s age

SibSp — the num­ber of sib­lings or spouses also on board

Parch — the num­ber of par­ents or chil­dren also on board

Ticket — the pas­sen­ger’s fare ticket num­ber Fare — the price paid for travel Cabin — the pas­sen­ger’s cabin num­ber

Em­barked — lo­ca­tion at which the pas­sen­ger boarded

The ‘train’ dataset has a 12th at­tribute called ‘sur­vived’ — whether the pas­sen­ger sur­vived (1) or not (0).

His­tory tells us those res­cued were pre­dom­i­nantly women and chil­dren, so bas­ing our ini­tial de­ci­sion tree clas­si­fier us­ing the ‘sex’ and ‘age’ at­tributes of the pas­sen­gers wasn’t a silly idea. How­ever, given we only scored 0.69856 shows th­ese two at­tributes alone don’t tell the full story.

Kag­gle al­lows users to pub­lish their mod­els — what Kag­gle calls ‘ker­nels’ — along with ex­pla­na­tions on why cer­tain op­tions were cho­sen. There are some very gen­er­ous ‘Kag­glers’ who have of­fered some bril­liant ex­plo­rations, backed up by sta­tis­ti­cal mod­el­ling — we’ve made a small list of some must-read ker­nels at the end of this article. The ideas we’ve used here com­monly fea­ture on Kag­gle and give you a guide of how to go about im­prov­ing clas­si­fi­ca­tion ac­cu­racy.

DATA CLEAN­ING

A com­mon prob­lem you’ll face with many real-world datasets is that of miss­ing data. It’s a prob­lem be­cause ma­chine-learn­ing al­go­rithms tend not to work well with what they can’t see. For starters, both the ‘train’ and ‘test’ datasets have records miss­ing val­ues for the ‘Age’ at­tribute — 177 in the train­ing set, 86 in the test set. Last time, we raced through this and sim­ply used Python’s fillna() func­tion to fill miss­ing Age at­tribute val­ues with ‘zero’. It’s sim­ple, but not nec­es­sar­ily the best move, since we’re ef­fec­tively chang­ing the distri­bu­tion of pas­sen­gers by ad­ding 263 new­borns.

An­other op­tion we could’ve taken is to sim­ply delete the records with miss­ing data, but that’s a sledge­ham­mer ap­proach (we also can’t do this with the ‘test’ dataset, other­wise our sub­mis­sion re­sults will be in­com­plete). A more rea­soned ap­proach is to cal­cu­late or ‘im­pute’ the pas­sen­ger’s age from other pas­sen­gers who share the same gen­der and ticket class (‘Pclass’) and tak­ing an av­er­age age from those pas­sen­gers. How­ever, the eas­i­est so­lu­tion is to use the me­dian — the age value of the mid­dle record when or­dered from low­est to high­est. The me­dian can also be used to fill in the one ‘Fare’ record value miss­ing.

AT­TRIBUTE SE­LEC­TION

As we’ve said, we used only two at­tributes last time — ‘sex’ and ‘age’, yet there are 11 at­tributes avail­able to us. One thing we could do straight­away is sim­ply add in the other at­tributes and cre­ate a new model — after all, the more data we can use, the bet­ter our model, right? Well, ac­tu­ally, no — not al­ways. Of the 11 at­tributes in the ‘train’ dataset, ‘Cabin’ is miss­ing 687 out of 891 val­ues — there’s no point in try­ing to guess them. Also, the ‘Pas­sen­ger ID’ at­tribute will be unique for ev­ery pas­sen­ger — there’s no way to cat­e­gorise them, so they’re of lit­tle use, too. The ‘Ticket’ at­tribute sim­i­larly looks ran­dom, so it’s of no ob­vi­ous help. The ‘Name’ at­tribute at first glance ap­pears sim­i­lar to ‘Pas­sen­ger ID’ in that ev­ery pas­sen­ger has a dif­fer­ent name, so there’s no ob­vi­ous way to group them. But that’s not strictly true and we’ll re­visit this at­tribute shortly.

For now, that leaves us with seven at­tributes — Pclass, Sex, Age, SibSp, Parch, Fare and Em­barked.

What we can do now is build a new model from th­ese seven at­tributes, sub­mit the re­sults to Kag­gle and see how we go. This we did and our score im­proved to 0.71770, but this only moved us up from 9,772nd to 9,604th po­si­tion. Clearly, there’s more to do.

AT­TRIBUTE EN­GI­NEER­ING

Re­mem­ber the ‘Name’ at­tribute? While ev­ery pas­sen­ger ob­vi­ously has one, each name value also hides in­for­ma­tion we can group on. For starters, it hides the ti­tle of each pas­sen­ger. The vast ma­jor­ity have stan­dard ti­tles of ‘Mr’, ‘Mrs’, ‘Miss’ or ‘Mas­ter’, but there are also rarer ti­tles such as ‘Sir’, ‘Dr’, ‘Count­ess’ and so on. To get bet­ter ac­cess to this data, we can cre­ate or ‘en­gi­neer’ a new at­tribute

called ‘Ti­tle’ that builds five cat­e­gories — one each for ‘Mr’, ‘Mrs’, ‘Miss’ and ‘Mas­ter’, plus an­other for the ‘rare’ ti­tles.

We use Python’s string split func­tion to split off the left and right sides at par­tic­u­lar points of each name value to leave us with just the pas­sen­ger’s ti­tle:

fullset[‘Ti­tle’] = fullset[‘Name’].str.split(“, “, ex­pand=True)[1].str.split(“.”, ex­pand=True)[0]

The ‘data clean­ing and at­tribute en­gi­neer­ing’ code block in our source code (de­tails be­low) and the code­lines start­ing with “fullset[‘Ti­tle’]” deal with cre­at­ing this at­tribute.

The ‘SibSp’ and ‘Parch’ at­tributes tell us the num­ber of sib­lings and spouses, par­ents and chil­dren each pas­sen­ger had — from this, we can build an­other at­tribute called ‘Fam­i­lySize’, de­not­ing the size of the fam­ily the pas­sen­ger is from. (Again, be­cause of space lim­i­ta­tions, we’re cut­ting across how the im­por­tance of this data was dis­cov­ered, read the ker­nels at the end of this story for more). The code­line for this is:

fullset[“Fam­i­lySize”] = fullset[“SibSp”] + fullset[“Parch”] + 1

Now we can build an­other de­ci­sion tree model, this time, re­plac­ing the ‘SibSp’ and ‘Parch’ at­tributes with ‘Fam­i­lySize’ and ad­ding in the ‘Ti­tle’ at­tribute — that gives us Age, Sex, Pclass, Fare, Ti­tle, Em­barked and Fam­i­lySize. Again, we sub­mit­ted the re­sults of the model to Kag­gle and our ac­cu­racy score rose to 0.73205, mov­ing us to 9,388th po­si­tion.

“De­ci­sion tree al­go­rithms are great for ma­chine-learn­ing for many rea­sons, but one of the main ones is that they’re easy to use.”

ONE TREE OR MANY TREES?

De­ci­sion tree al­go­rithms are great for ma­chine learn­ing for many rea­sons, but one of the main ones is that they’re easy to use — if you can fol­low a roadmap, you can gen­er­ally read a de­ci­sion tree. The prob­lem, though, is that one de­ci­sion tree can only rep­re­sent one view of the data — the dom­i­nant view. For other less­dom­i­nant but in­for­ma­tion­ally-rich views, one tree isn’t enough. We looked briefly a cou­ple of is­sues ago at the con­cept of ‘de­ci­sion forests’ that com­bines mul­ti­ple de­ci­sion trees to­gether. One of the many al­go­rithms avail­able to Python is ‘Ran­dom For­est’, a bril­liant al­go­rithm de­vel­oped by Leo Breiman in 2001 — it’s quite fast, can of­fer high ac­cu­racy and can build a for­est of de­ci­sion trees as large as you need, to cover many dif­fer­ent data views. If you’re se­ri­ous about un­der­stand­ing ma­chine learn­ing, this is one of the clas­si­fi­ca­tion al­go­rithms you should get a de­cent han­dle on (we’ll look at it in a fu­ture mas­ter­class).

By sim­ply re­plac­ing the ‘de­ci­sion tree’ al­go­rithm with ‘ran­dom for­est’ in­stead, we can broaden our model to con­sider dif­fer­ent pos­si­bil­i­ties.

FI­NAL KAG­GLE SUB­MIS­SION

The proof of the power of Ran­dom For­est is in its re­sults. Again, we build our model, pro­duce our re­sults and sub­mit them back to Kag­gle. This time, how­ever, the com­bi­na­tion of at­tribute en­gi­neer­ing and chang­ing from ‘de­ci­sion tree’ to ‘ran­dom for­est’ has boosted our score from 0.73205 to 0.81339. It might not seem like a huge gain, but in terms of ac­cu­racy, it’s ac­tu­ally a pretty de­cent boost. What’s more, it hasn’t hurt us on the Kag­gle leader­board, ei­ther — from 9,388th po­si­tion, we’ve rock­eted up more than 9,000 paces to 578th spot and sit­ting in­side the top 6% of re­sults.

HOW YOU DO IT

Now that we’ve ex­plained what we did, for those who missed last month’s mas­ter­class, here’s a brief ‘how to do it your­self’. First, sign up to Kag­gle ( www.kag­gle.com) — it’s free. Next, across the top menu, select ‘Ker­nels’, then click on the white ‘New Ker­nel’ but­ton near the top-right of the fresh ‘Ker­nels’ page. You’ll then be asked to select the ker­nel type — last month, we se­lected ‘note­book’; this time around, we’re go­ing for ‘script’. The dif­fer­ence be­tween the two is that a note­book al­lows you to run your code in sec­tions or ‘cells’, whereas a ‘script’ is a top­down com­plete run-in-one code ed­i­tor. That suits us more this time.

Once the new Script ed­i­tor ap­pears, you’ll see a ‘Data’ group­box on the right side with an ‘Add Dataset’ but­ton. Press it. On the new ‘Add Data Source’ win­dow, click on the ‘Com­pe­ti­tions’ menu list and select ‘Ti­tanic: Ma­chine Learn­ing from Dis­as­ter’ — this will load the dataset and even­tu­ally bring you back to the ed­i­tor. Now head over to our web­site at www.apc­mag.com/ magstuff and down­load the Ti­tanic com­pe­ti­tion script file ‘ti­tan­ic_script. zip’. Un­zip it and copy the con­tents of the ‘ti­tan­ic_script.py’ file into the script ed­i­tor. Click at the top left and en­ter a name for your script, click back in the main text area and the blue ‘Com­mit & Run’ but­ton should light up. Press it. It’ll grey out while your code ex­e­cutes and the out­put file is cre­ated. When it colours blue again, press the dou­bleleft ar­row in the top-left of the win­dow. This takes you back to a sum­mary win­dow of your script. Un­der the header, you’ll see a hor­i­zon­tal menu. If your code has ex­e­cuted cor­rectly, you’ll see an ‘Out­put’ menu op­tion. Click on it. Your out­put file ‘ti­tan­icModelPre­dic­tion.csv’ will ap­pear on the left. To the right, you’ll find a ‘Sub­mit to Com­pe­ti­tion’ but­ton. Press it once and you’ll au­to­mat­i­cally send that file for scor­ing. Within a few sec­onds, you should get your score back. The ran­dom se­lec­tion method we’re us­ing for train­ing records means we gen­er­ate a slightly dif­fer­ent model each time it runs. This also varies the re­sult­ing dataset file you send back so you may seem some vari­a­tion in the ac­cu­racy re­sult.

One thing to re­mem­ber — Kag­gle al­lows you 10 sub­mis­sions per com­pe­ti­tion per day. If you’re on the Aus­tralian east coast, that num­ber re­sets at 10am each morn­ing.

WHAT TO READ

Kag­gle al­lows users to pub­lish their ‘ker­nel’ code on the Kag­gle.com web­site. If you’re in­ter­ested to learn more, but a lit­tle stuck on where to start, we’d sug­gest read­ing through th­ese ker­nels:

Ex­plor­ing Sur­vival on the Ti­tanic — kag­gle.com/mris­dal/ex­plor­ing­sur­vival-on-the-ti­tanic

A Data Sci­ence Frame­work: To Achieve 99% Ac­cu­racy — kag­gle.com/ ld­free­man3/a-data-sci­ence­frame­work-to-achieve-99-ac­cu­racy

Ti­tanic Data Sci­ence So­lu­tions — kag­gle.com/star­tup­sci/ti­tan­ic­data-sci­ence-so­lu­tions

HAVE A GO

So, there you go. We started this month lan­guish­ing in the bot­tom 6% of the Ti­tanic leader­board at 9,772nd and we fin­ish it in­side the top 6% at 578th spot — that’ll do. Be­cause of the way Kag­gle cal­cu­lates scores for the pub­lic leader­board, we could spend an­other six months chas­ing a last few per­cent­age points, but it’s time for us to move on and look at other ar­eas. Mean­while, the best way to learn is by do­ing. See you next time.

Use the Add Dataset and Com­mit & Run but­tons to run your source script.

The code for data clean­ing and at­tribute en­gi­neer­ing isn’t too com­pli­cated.

Kag­gle also runs com­pe­ti­tions for third-par­ties of­fer­ing de­cent prizes.

Kag­gle lim­its you to 10 sub­mis­sions per com­pe­ti­tion per day.

You can also copy our code by- sec­tion into a Kag­gle Note­book.

Use the ‘ Sub­mit to Com­pe­ti­tion’ but­ton via the Out­put menu screen.

You’ll find our Ti­tanic script source code at www. apc­mag.com/magstuff.

Megan Ris­dal’s Ti­tanic ker­nel is a great read for learn­ing fea­ture en­gi­neer­ing.

Kag­gle’s Ti­tanic com­pe­ti­tion is a great way to get into ma­chine learn­ing.

Our new code pushed us up into 578th spot and the top 6% of re­sults.

Newspapers in English

Newspapers from Australia

© PressReader. All rights reserved.