How image recognition works
Machine-learning can learn almost anything – even to recognise images. Darren Yates explains how using the Python programming language and Scikit-Learn library.
Image recognition is one of the hottest areas right now, not just in ‘artificial intelligence’ circles, but in tech in general. It’s used in everything from detecting postal addresses on mail to faces in the street and weeds in the paddock. But how does it work? How does AI not only distinguish between different images, but recognise what an image contains?
PATTERNS AND ALGORITHMS
This will sound dopey, but go with me for a sec. Take a look at your family photos and you’ll recognise your family members. Why? Because you’ve seen them before – your brain has been ‘trained’ to recognise those people by associating different images with different people.
Machine-learning is no different. Most machine-learning algorithms can’t do much on their own – they need to be trained with examples of things you want them to learn. In the process of that training, they produce a recipe or ‘model’ that explains how the learning works. The fuel for machinelearning is data and invariably this data looks like a standard spreadsheet. Across the top, you have columns where each column represents a particular feature of the objects you want the algorithm to learn. For example, say we’re trying to learn different vehicle types from their features or ‘attributes’ – those attributes would include engine size, number of doors, number of wheels and so on. One of those columns is the category or ‘class’ the object belongs to, for example, truck, car, motorcycle etc. Each row of the spreadsheet represents one complete example of object. To be able to recognise or ‘classify’ different objects, a machine-learning algorithm is essentially looking for patterns in the data.
LOTS OF PIXELS
Any digital image is a series of pixels. If you have a basic 320x240-pixel image, you have a series of dots or ‘pixels’ in a 320 column-by-240 row grid, where each dot is a particular colour. If it’s a standard digital photo, it’ll likely be a 24-bit image, meaning that each pixel has three separate one-byte (255 levels) values of red, green and blue. The three bytes are typically combined into one 24-bit number representing one of 16.7million possible colours.
If we want a machine-learning algorithm to learn to recognise a series of 320x240-pixel images, we need to create a ‘record’ (spreadsheet row) for each image such that all of those pixels are on a single row. In this case, we have to take all 240 rows of pixels and place them side-by-side, so that instead
of 320x240-cell spreadsheet, we create one row with 76,800 columns, plus one extra that is the ‘class’ column or ‘attribute’ that says what this image is.
So, there are two things happening – one is the algorithm has to learn the patterns within those 76,800 columns that distinguishes images, plus it has to associate images with different class values.
HAND-WRITTEN DIGITS
A really simple example of this is available in the Python Scikit-Learn library called ‘digits’. It’s a series of 5,620 images of hand-written digits 0 to 9. Each image is 8x8-pixels, so pretty tiny, but in order to be useable in machine-learning this ‘dataset’ exists as a series of 5,620 rows, each with 64 columns, plus one column or ‘class’ that labels the row with the digit it is supposed to be. Once an algorithm has created a model from these images, that model can be used to identify or ‘classify’ similar images that have not been previously seen.
This is an example of what’s called ‘supervised’ learning, because the algorithm is essentially given the answers in the data. Think back to when you learned maths at school – you were given examples to do and those examples had answers. In machinelearning, this is the ‘training’ phase. Once you learned how to do a particular
maths task, you had to prove you knew how to do by doing it in an exam and providing your own answer. Not surprisingly, this is the ‘testing’ phase in machine-learning too.
GET THE CODE
Grab a copy of Python for your PC from python.org/downloads and install it. Launch a command prompt in the following folder:
\users\
replacing
In the command prompt, type the following: pip install –U scikit-learn pip install –U matplotlib and hit the
Once you’ve done that, head over to the scikit-learn website and the ‘Recognizing hand-written digits’ page (tinyurl.com/y5tchmgw). Scroll down to the bottom of the page and click on the ‘Download Python source code… py’ button. When the download is complete, open up the IDLE integrated development environment, select File, Load and load up the file you just downloaded. When you’re ready, press the F5 key or select ‘Run’, ‘Run module’ from the menu.
After a few seconds, you’ll get output
on the Python Shell window, plus a second window labelled ‘Figure 1’ with some admittedly dire-looking handwritten digits.
WHAT’S IT ALL MEAN?
Looking at ‘Figure 1’ first, the top row shows four of the ‘training’ images – these are 8x8-pixel images with a ‘class’ value identifying the number the image is supposed to show. Underneath is examples of the ‘testing’ images – these show an image, plus the predicted ‘class’ value the learned model thinks the image represents. Or, in other words, the number the model thinks the image looks like.
Now looking back at the Python Shell output, what’s produced here is first of all, an accuracy-by-class report. It shows the accuracy of the model to recognising the images by their possible values 0 through to 9. You can see down the ‘precision’ column accuracy was at worst 0.93 (93%) for digit ‘8’ and at best 1.0 (100%) for digit ‘0’. The overall or ‘weighted’ average is 0.97 (97%). The ‘support’ column is the number of images classed as that particular digit, with a total of 899 images used for testing.
Underneath is what’s called a ‘confusion matrix’, not because it causes confusion, but as a way of understanding the difference between what an image really is (going down the rows) and what the model thinks or ‘predicts’ it is (across columns). Think of it as the difference between the correct answers in an exam and the answers you give. Ideally, you should have numbers only going down the centre-diagonal and the rest should be ‘0’. This would be 100% accuracy.
TRY IT YOURSELF
The algorithm used in this example is called a ‘support vector machine’ (SVM), but you could use a decision tree or a multi-tree algorithm such as ‘RandomForest’ to give similar results. Image recognition has application in many areas, so it’s well worth even learning the basics of how it works.