Computer vision and artificial neural networks
Can a computer really think like a human brain and recognise objects in an image? Darren Yates explains how it’s done using machine-learning.
Any way you think about it, driverless cars are pretty clever tech. Even just the ability of a car’s on-board computer to recognise obstacles in real-time and direct the vehicle to safely avoid them is clever stuff. However, it’s only one application of this tech broadly known as ‘computer vision’ that’s making waves. Facial recognition is fast becoming a hot-button issue, but putting the ethics, security and privacy issues aside for the moment, how does it actually work? It’s a huge, complex research area globally, so this month, we’re looking at the basics of computer vision and our first peak into another broad buzzword area in tech at the moment – ‘deep learning’.
DEEP LEARNING 101
If there’s one part of the human body we’re still far from figuring out, it’s the brain. That grey matter inside your skull is a complex network of cells or ‘neurons’ joined together by connections called ‘synapses’. This structure is commonly said to be the inspiration for a type of machinelearning popularly known as ‘deep learning’. Google offers it as a service known as ‘TensorFlow’ and Facebook uses it in face recognition.
At its simplest, deep learning attempts to model this brain-like structure through a group of machinelearning techniques collectively called ‘artificial neural networks’ (ANNs), which are basically an interconnected network of artificial neurons. There are different types of neurons, but a neuron in machine-learning is just a mathematical building-block or equation that takes in several input parameters, weighs up those inputs according to different criteria and produces an output. Now don’t get the idea that deep learning fell from the sky yesterday; it didn’t – the neuron I’ve just described is called the ‘perceptron’. It was developed in 1957.
If you want to draw it out, you can see an example of a perceptron in Figure 1. It has four inputs with values x0, x1, x2 and x3, respectively. Each of these inputs also has a weight, w0, w1, w2 and w3, usually a decimal number between 0 and 1 that’s multiplied with its associated input to determine how much importance we give it. These values are then added together to give a single number – if that number is less than or equal to a certain threshold, the output of the perceptron is considered ‘0’; if it’s greater than that threshold, the output is set to ‘1’.
HOW IT WORKS
We can use this perceptron to help us decide whether to buy the iPhone XS Max phone or the Pixel 3 XL. Let’s say there are four features we’re interested in – screen size (x0), price (x1), battery life (x2) and CPU speed (x3). All four inputs are binary, which means if we decide the Pixel 3 XL wins on a feature, we’ll score it ‘1’, if the XS wins, it scores a ‘0’. However, not all four features have the same importance to us – for example, price is more important than CPU speed and battery life more important than screen size. So, we create weight values for these inputs – the higher the
weighting, the more important it is. Let’s say we give screen size (w0) a weight of 0.2, price (w1) 0.8, battery life (w2) 0.6 and CPU speed (w3) 0.4. Now we look for the phone specs:
The iPhone XS Max has a larger screen, so the x0 input is ‘0’ (i.e. XS Max wins).
The Pixel 3 XL is cheaper, so the x1 input is ‘1’
The Pixel 3 XL has longer battery life, so the x2 input is ‘1’
The iPhone XS Max has greater speed, so the x3 input is ‘0’
We’ll set our threshold level to ‘1’ – that means if the weighted sum is greater than ‘1’, we buy the Pixel 3 XL; otherwise, it’s the iPhone XS Max. The way we calculate the perceptron output is by doing this:
output = w0x0 + w1x1 + w2x2 + w3x3
= (0.2 x 0) + (0.8 x 1) + (0.6 x 1) + (0.4 x 0) = 0 + 0.8 + 0.6 + 0 = 1.4
Since the output of 1.4 is greater than one, we buy the Pixel 3 XL. No doubt you can see that the final result depends highly on the weights and the threshold level – change these around and you get a completely different result. Technically, there’s no limit to the number of inputs you can include, either.
Sure, this might be a simple example. Nevertheless, it is how we often go through the buying process – we weight up the factors most important to us, compare features and make a decision. A perception just makes this a more mathematical process.
In reality (thankfully), the neurons in our brains aren’t quite that simple. Still, where it gets really interesting from a machine-learning viewpoint is when you start combining these perceptrons together to form a network. One of the simplest artificial neural networks is called the ‘multilayer perceptron’ or ‘feed-forward neural network’ shown in Figure 2. It consists of three layers of perceptrons – the first layer is the ‘input layer’, which takes the input values and applies each value to each input perceptron. That’s followed by the ‘hidden layer’. The perceptrons in this layer take their inputs from the output of each of the input layer perceptrons. Finally, the outputs from the hidden layer are added together at the ‘output layer’, where the final output is taken. It is possible to have more than one hidden layer, but in reality, few applications need a second one.
How many perceptrons do you need in total? Basically, you need one perceptron for each input on your input layer and often, you only need one perceptron at the output. However, there’s no hard and fast rule for setting the number of hidden layer perceptrons, but typically you choose fewer than the input layer and more than the output layer. In this drawn example, we’ve chosen two.
As we mentioned before, computer vision is a popular application of ANNs. The data science competition website, Kaggle, has an open competition called ‘Digit Recognizer’. It’s based on the popular MNIST dataset of handwritten images. It consists of 70,000 images, each image is 28x28-pixels (784 pixels in total) and shows a handwritten digit selected from 0 to 9. The image data has been translated so that each image is now a single row in the dataset and each column covers one of those 784 pixels with a value between 0 and 255 representing the 8-bit greyscale colour of that pixel. Add in a class attribute, which is a label of the actual digit the image attributes represent and you have 785 columns. The Kaggle version consists of 42,000 training images and 28,000 test images. Our task is to learn a model from the 42,000 training images, then use that model to predict the digits drawn in each of the 28,000 test images (which have no digit labels), submit the predictions to Kaggle and see how good our model is.
Over the last few months, we’ve used the R programming language and RStudio IDE to create submissions for Kaggle competitions. These are both free and open-source – you’ll find R at the CSIRO’s CRAN mirror ( cran.csiro.au) and the free RStudio Desktop at rstudio.com/ products/rstudio/download. You need to install R before RStudio, then launch RStudio. You’ll find the dataset for this competition at kaggle.com/c/ digit-recognizer/data. You also need a Kaggle account to get the data – it’s free to sign up.
GRAB THE DATA
Head to www.kaggle.com/c/digitrecognizer/data and half-way down, you’ll see a ‘Download All’ button on the right-side. Click it. The file is 15MB. Unzip it and you’ll see three files – sample_submission.csv, test.csv and train.csv. The first is the format your submission file must take to be accepted by Kaggle’s input systems. The second file is the one you create your model from. The third file is the one you use to test your model against.
GRAB THE SOURCE
We don’t have enough space to go through the whole process this month, so download our R-script from the website at http://apcmag.com/ magstuff. Load it into RStudio. Before you run it, you’ll need to run a couple of commands in RStudio’s console area: install.packages(‘caret’) install.packages(‘nnet’)
The first command installs the ‘caret’ package, which includes an excellent confusion matrix function, while the second provides us a feed-forward neural network library.
RUN THE CODE
Once the library packages are installed, press the ‘source’ button on the top-right of the code editor panel in RStudio and it’ll go to work. Our source script is only single-threaded - that means it’ll take a bit of time (about five minutes) to build the neural network model. Just briefly, our R-script takes the training dataset and splits it 66:34 into two subsets, one for training the model and one for testing or ‘validating’ it. The confusion matrix tells us how well our neural network model did on predicting the digit images in the validation subset. Once that’s done, we then let the model loose on the actual test dataset. When complete, we write the results automatically to a .csv file. You submit that file to Kaggle and see how you go on the leaderboard. This, we did – but only managed 2,623th position out of 2,718. That’s pretty low, but not really surprising. This initial script does work, but so that you’re not sitting around forever waiting for it, we designed it more for speed than accuracy. Next time, we’ll explain how you tune a neural network to improve accuracy – and get a better Kaggle score!
Numbers on the diagonal show how well our predictions match.
A perceptron takes several inputs, weights them and creates a single output.
A feed-forward neural network features several multi-connected perceptrons.
Our first Neural Network R-script written in RStudio.
Kaggle’s Digit Recognizer comp is a great way to practise your ANNs.
The MNIST dataset is the ‘hello world’ of computer vision learning.
Drag your output csv file to the Kaggle page, press the Submit button below.
Run the two ‘install. packages’ console commands before you run the script.