Com­puter vi­sion and ar­ti­fi­cial neu­ral net­works

Can a com­puter re­ally think like a hu­man brain and recog­nise ob­jects in an im­age? Dar­ren Yates ex­plains how it’s done us­ing ma­chine-learn­ing.

APC Australia - - Contents -

Any way you think about it, driver­less cars are pretty clever tech. Even just the abil­ity of a car’s on-board com­puter to recog­nise ob­sta­cles in real-time and di­rect the ve­hi­cle to safely avoid them is clever stuff. How­ever, it’s only one ap­pli­ca­tion of this tech broadly known as ‘com­puter vi­sion’ that’s mak­ing waves. Fa­cial recog­ni­tion is fast be­com­ing a hot-but­ton is­sue, but putting the ethics, se­cu­rity and pri­vacy is­sues aside for the mo­ment, how does it ac­tu­ally work? It’s a huge, com­plex re­search area glob­ally, so this month, we’re look­ing at the ba­sics of com­puter vi­sion and our first peak into an­other broad buzz­word area in tech at the mo­ment – ‘deep learn­ing’.


If there’s one part of the hu­man body we’re still far from fig­ur­ing out, it’s the brain. That grey mat­ter in­side your skull is a com­plex net­work of cells or ‘neu­rons’ joined to­gether by con­nec­tions called ‘synapses’. This struc­ture is com­monly said to be the in­spi­ra­tion for a type of ma­chine­learn­ing pop­u­larly known as ‘deep learn­ing’. Google of­fers it as a ser­vice known as ‘Ten­sorFlow’ and Face­book uses it in face recog­ni­tion.

At its sim­plest, deep learn­ing at­tempts to model this brain-like struc­ture through a group of ma­chine­learn­ing tech­niques col­lec­tively called ‘ar­ti­fi­cial neu­ral net­works’ (ANNs), which are ba­si­cally an in­ter­con­nected net­work of ar­ti­fi­cial neu­rons. There are dif­fer­ent types of neu­rons, but a neu­ron in ma­chine-learn­ing is just a math­e­mat­i­cal build­ing-block or equa­tion that takes in sev­eral in­put pa­ram­e­ters, weighs up those in­puts ac­cord­ing to dif­fer­ent cri­te­ria and pro­duces an out­put. Now don’t get the idea that deep learn­ing fell from the sky yes­ter­day; it didn’t – the neu­ron I’ve just de­scribed is called the ‘per­cep­tron’. It was de­vel­oped in 1957.


If you want to draw it out, you can see an ex­am­ple of a per­cep­tron in Fig­ure 1. It has four in­puts with val­ues x0, x1, x2 and x3, re­spec­tively. Each of these in­puts also has a weight, w0, w1, w2 and w3, usu­ally a dec­i­mal num­ber be­tween 0 and 1 that’s mul­ti­plied with its associated in­put to de­ter­mine how much im­por­tance we give it. These val­ues are then added to­gether to give a sin­gle num­ber – if that num­ber is less than or equal to a cer­tain thresh­old, the out­put of the per­cep­tron is con­sid­ered ‘0’; if it’s greater than that thresh­old, the out­put is set to ‘1’.


We can use this per­cep­tron to help us de­cide whether to buy the iPhone XS Max phone or the Pixel 3 XL. Let’s say there are four fea­tures we’re in­ter­ested in – screen size (x0), price (x1), bat­tery life (x2) and CPU speed (x3). All four in­puts are bi­nary, which means if we de­cide the Pixel 3 XL wins on a fea­ture, we’ll score it ‘1’, if the XS wins, it scores a ‘0’. How­ever, not all four fea­tures have the same im­por­tance to us – for ex­am­ple, price is more im­por­tant than CPU speed and bat­tery life more im­por­tant than screen size. So, we cre­ate weight val­ues for these in­puts – the higher the

weighting, the more im­por­tant it is. Let’s say we give screen size (w0) a weight of 0.2, price (w1) 0.8, bat­tery life (w2) 0.6 and CPU speed (w3) 0.4. Now we look for the phone specs:

The iPhone XS Max has a larger screen, so the x0 in­put is ‘0’ (i.e. XS Max wins).

The Pixel 3 XL is cheaper, so the x1 in­put is ‘1’

The Pixel 3 XL has longer bat­tery life, so the x2 in­put is ‘1’

The iPhone XS Max has greater speed, so the x3 in­put is ‘0’

We’ll set our thresh­old level to ‘1’ – that means if the weighted sum is greater than ‘1’, we buy the Pixel 3 XL; oth­er­wise, it’s the iPhone XS Max. The way we cal­cu­late the per­cep­tron out­put is by do­ing this:

out­put = w0x0 + w1x1 + w2x2 + w3x3

= (0.2 x 0) + (0.8 x 1) + (0.6 x 1) + (0.4 x 0) = 0 + 0.8 + 0.6 + 0 = 1.4

Since the out­put of 1.4 is greater than one, we buy the Pixel 3 XL. No doubt you can see that the fi­nal re­sult de­pends highly on the weights and the thresh­old level – change these around and you get a com­pletely dif­fer­ent re­sult. Tech­ni­cally, there’s no limit to the num­ber of in­puts you can in­clude, ei­ther.

Sure, this might be a sim­ple ex­am­ple. Nev­er­the­less, it is how we of­ten go through the buy­ing process – we weight up the fac­tors most im­por­tant to us, com­pare fea­tures and make a de­ci­sion. A per­cep­tion just makes this a more math­e­mat­i­cal process.


In re­al­ity (thank­fully), the neu­rons in our brains aren’t quite that sim­ple. Still, where it gets re­ally in­ter­est­ing from a ma­chine-learn­ing view­point is when you start com­bin­ing these perceptrons to­gether to form a net­work. One of the sim­plest ar­ti­fi­cial neu­ral net­works is called the ‘mul­ti­layer per­cep­tron’ or ‘feed-for­ward neu­ral net­work’ shown in Fig­ure 2. It con­sists of three lay­ers of perceptrons – the first layer is the ‘in­put layer’, which takes the in­put val­ues and ap­plies each value to each in­put per­cep­tron. That’s fol­lowed by the ‘hid­den layer’. The perceptrons in this layer take their in­puts from the out­put of each of the in­put layer perceptrons. Fi­nally, the out­puts from the hid­den layer are added to­gether at the ‘out­put layer’, where the fi­nal out­put is taken. It is pos­si­ble to have more than one hid­den layer, but in re­al­ity, few ap­pli­ca­tions need a sec­ond one.

How many perceptrons do you need in to­tal? Ba­si­cally, you need one per­cep­tron for each in­put on your in­put layer and of­ten, you only need one per­cep­tron at the out­put. How­ever, there’s no hard and fast rule for set­ting the num­ber of hid­den layer perceptrons, but typ­i­cally you choose fewer than the in­put layer and more than the out­put layer. In this drawn ex­am­ple, we’ve cho­sen two.


As we men­tioned be­fore, com­puter vi­sion is a pop­u­lar ap­pli­ca­tion of ANNs. The data sci­ence com­pe­ti­tion web­site, Kag­gle, has an open com­pe­ti­tion called ‘Digit Rec­og­nizer’. It’s based on the pop­u­lar MNIST dataset of hand­writ­ten images. It con­sists of 70,000 images, each im­age is 28x28-pix­els (784 pix­els in to­tal) and shows a hand­writ­ten digit se­lected from 0 to 9. The im­age data has been trans­lated so that each im­age is now a sin­gle row in the dataset and each col­umn cov­ers one of those 784 pix­els with a value be­tween 0 and 255 rep­re­sent­ing the 8-bit greyscale colour of that pixel. Add in a class at­tribute, which is a la­bel of the ac­tual digit the im­age at­tributes rep­re­sent and you have 785 columns. The Kag­gle ver­sion con­sists of 42,000 train­ing images and 28,000 test images. Our task is to learn a model from the 42,000 train­ing images, then use that model to pre­dict the dig­its drawn in each of the 28,000 test images (which have no digit la­bels), sub­mit the pre­dic­tions to Kag­gle and see how good our model is.

Over the last few months, we’ve used the R pro­gram­ming lan­guage and RS­tu­dio IDE to cre­ate sub­mis­sions for Kag­gle com­pe­ti­tions. These are both free and open-source – you’ll find R at the CSIRO’s CRAN mir­ror ( and the free RS­tu­dio Desk­top at rs­tu­ prod­ucts/rs­tu­dio/down­load. You need to in­stall R be­fore RS­tu­dio, then launch RS­tu­dio. You’ll find the dataset for this com­pe­ti­tion at kag­ digit-rec­og­nizer/data. You also need a Kag­gle ac­count to get the data – it’s free to sign up.


Head to www.kag­­itrec­og­nizer/data and half-way down, you’ll see a ‘Down­load All’ but­ton on the right-side. Click it. The file is 15MB. Un­zip it and you’ll see three files – sam­ple_­sub­mis­sion.csv, test.csv and train.csv. The first is the for­mat your sub­mis­sion file must take to be ac­cepted by Kag­gle’s in­put sys­tems. The sec­ond file is the one you cre­ate your model from. The third file is the one you use to test your model against.


We don’t have enough space to go through the whole process this month, so down­load our R-script from the web­site at http://apc­ magstuff. Load it into RS­tu­dio. Be­fore you run it, you’ll need to run a cou­ple of com­mands in RS­tu­dio’s con­sole area: in­stall.pack­ages(‘caret’) in­stall.pack­ages(‘nnet’)

The first com­mand in­stalls the ‘caret’ pack­age, which in­cludes an ex­cel­lent con­fu­sion ma­trix func­tion, while the sec­ond pro­vides us a feed-for­ward neu­ral net­work li­brary.


Once the li­brary pack­ages are in­stalled, press the ‘source’ but­ton on the top-right of the code ed­i­tor panel in RS­tu­dio and it’ll go to work. Our source script is only sin­gle-threaded - that means it’ll take a bit of time (about five min­utes) to build the neu­ral net­work model. Just briefly, our R-script takes the train­ing dataset and splits it 66:34 into two sub­sets, one for train­ing the model and one for test­ing or ‘val­i­dat­ing’ it. The con­fu­sion ma­trix tells us how well our neu­ral net­work model did on pre­dict­ing the digit images in the val­i­da­tion sub­set. Once that’s done, we then let the model loose on the ac­tual test dataset. When com­plete, we write the re­sults au­to­mat­i­cally to a .csv file. You sub­mit that file to Kag­gle and see how you go on the leader­board. This, we did – but only man­aged 2,623th po­si­tion out of 2,718. That’s pretty low, but not re­ally sur­pris­ing. This ini­tial script does work, but so that you’re not sit­ting around for­ever wait­ing for it, we de­signed it more for speed than ac­cu­racy. Next time, we’ll ex­plain how you tune a neu­ral net­work to im­prove ac­cu­racy – and get a bet­ter Kag­gle score!

Num­bers on the di­ag­o­nal show how well our pre­dic­tions match.

A per­cep­tron takes sev­eral in­puts, weights them and cre­ates a sin­gle out­put.

A feed-for­ward neu­ral net­work fea­tures sev­eral multi-con­nected perceptrons.

Our first Neu­ral Net­work R-script writ­ten in RS­tu­dio.

Kag­gle’s Digit Rec­og­nizer comp is a great way to prac­tise your ANNs.

The MNIST dataset is the ‘hello world’ of com­puter vi­sion learn­ing.

Drag your out­put csv file to the Kag­gle page, press the Sub­mit but­ton be­low.

Run the two ‘in­stall. pack­ages’ con­sole com­mands be­fore you run the script.

Newspapers in English

Newspapers from Australia

© PressReader. All rights reserved.