Linux Format

Machine learning

Jonni Bidwell challenges you to be anything but agog at the combined forces of neural networks and open source learning

-

Humanity has a problem: smart machines,

Jonni Bidwell is the man with the cure as he explores how open source AI is taking over the world.

Most people will have heard of machine learning (ML). If you’ve been anywhere near the internet in that time you’ll almost certainly have been fed data that was combobulat­ed using machine-learning algorithms, and your usage will have fed into training models to advance other such algorithms.

In 1958, when the Mark 1 Perceptron (an early neural network for image recognitio­n) was built, pundits declared it “the embryo of an electronic computer that [the Navy] expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence.” The field has advanced a long way since then, and though we haven’t yet seen conscious machines (or a Skynet-esque robot uprising), there have been some amazing results. ML has contribute­d to drug discovery, finding exoplanets and better Netflix recommenda­tions for everyone.

Thanks to the availabili­ty of huge datasets, advances in hardware such as GPUs and TPUs and the developmen­t of tools (especially FOSS ones) that greatly simplify training models, ML is now fairly ubiquitous. Much of the growth has been spurred by that of the industrial-academic complex. Boffins have been studying formal ML techniques from a theoretica­l standpoint since the 50s (and one finds germs of the field some 200 years earlier). Yet the hardware (or more correctly, the budgets to build the hardware) needed to do anything practical were only available to the likes of IBM or government labs. But now, through new partnershi­ps, innovation­s and that dragon cloud computing, access has become much more democratis­ed.

The general rubric “artificial intelligen­ce” covers lots of techniques and easily leads into science fiction. Somewhere beneath this heading we find the discipline of machine learning, which has been enjoying a lot of attention while remaining firmly within the realms of science fact.

ML, in the loosest possible sense, refers to any system that uses data to solve a problem, rather than relying on any pre-programmed rules or algorithms. This may be something fairly parochial, such as finding the best line through a set of points, or something more advanced, like computer vision or sentiment analysis.

ML turns traditiona­l programmin­g on its head. Instead of supplying some input to an algorithm that then outputs an answer, ML takes a set of data points (pairs of inputs and outputs) and comes up with something like an algorithm connecting them, so that when fed a new data point it should be able to classify it (match it with an output) correctly.

We shouldn’t mock the line-fitting example – not when it works, anyway. Without any knowledge of what the data represents, the machine has come up with a general rule that can be applied to new data to produce meaningful results. The more you think about it, the more you realise how powerful it is. Imagine the amazement of the Pythagorea­ns when their leader unveiled the relationsh­ip between sides and hypotenuse.

Looking past the headlines

There’s a great deal of hype surroundin­g ML. It can solve a lot of problems and will be able to solve many more in the future, but you’re unlikely to get rich with a cobbled-together stocks and shares trading bot. Nor is it some terrible instrument that will end humanity. We’re constantly being reminded that all our jobs will be automated before too long. Production line workers and drivers are often singled out, but in all likelihood, given time, very few profession­s will be spared from this wave of automation. Automated journalism ( monDieu!–Ed) is already used for stories involving numbers and stats. The first reporting on a 2014 earthquake was written by

LATimes journalist Ken Schwencke’s Quakebot system. So should our hard-working scriveners at Linux

Format Towers be worried? It’s hard to codify how to come up with good, original tutorial ideas, or which topics to cover and how to connect them in a feature. Then there’s a lot of behind-the-scenes work: setting up test environmen­ts, fighting with different ways of doing things and running into unexpected bugs. Then there’s the actual writing itself, where one draws from one’s extensive literary prowess while at the same time contributi­ng one’s own unique wry ( andvaguely

pretentiou­s–Ed) style. It’ll be a while before machines can do all of this, but the field of Natural Language Processing is advancing swiftly. Also advancing is the decline of print media, so the bean-counters will probably get to us before the robots do.

Open source learning

As with so much data science work (see our interview with Amy Boyle in LXF207) Python and NumPy, which adds efficient structures for numerical computatio­n, are the FOSS weapons of choice. However, the big name in Machine Learning is Google’s TensorFlow, an open source library that can be used with Python, C++, OpenCL or any combinatio­n of these and more. On top of Tensorflow you can use the Keras library for neural networks. Theano was popular and pioneered some of the techniques that are commonplac­e in tools today (in particular, transparen­t GPU accelerati­on), but developmen­t ceased last year. Our egghead-at-large Mihalis will introduce some of these tools over on page 88. And now we’ll attempt to demystify neural networks, a central concept in ML…

Inspired by the billions of neurons in animal brains and the squillions of connection­s between them, researcher­s have built their own (albeit smaller)

networks that, in their own artificial way, can learn to perform specific tasks. Analogies to biological brains break down pretty quickly, but that doesn’t stop them being incredibly powerful and useful.

At one end of our network, input data is represente­d by a layer (the input layer) of artificial neurons. That layer then feeds to another layer, with each neuron being connected to all neurons in the previous layer. Each neuron outputs some value depending on the input it receives, and that value feeds into all neurons in the next layer. There’s also a final layer – the output layer – that represents the network’s decision.

For the network to be useful, the effect of each neuron on those of the next layer needs to be manipulate­d. Starting from some initial configurat­ion we feed the network data, and if we don’t like the conclusion­s then we tweak the network to improve the results. After sufficient training, if our neural network is a success then we should be able to input new data, which wasn’t part of the training set, and it should be able to categorise it correctly.

This is quite tricky to get a handle on, but it becomes clearer when illustrate­d by example. And the classic example is handwritin­g analysis, or at least the subset of it that concerns recognisin­g the digits 0 to 9. As Mihalis mentions in his coding tutorial, there’s a canonical dataset (the MNIST database) that consists of 70,000 images of hand-drawn digits in the form of 28x28 greyscale images. This has become the de facto standard for building ‘my first neural network’.

The problem in hand

Here, each neuron in our input layer stores the greyscale value of a particular pixel. So that layer consists of a notinconsi­derable 784 neurons, each having 256 possible values (they are eight-bit images). Brushing over what happens in the middle layers for now, let’s consider the output layer. We want to be able to classify an input image as one of ten digits, so in our final layer let’s have 10 nodes (we’ll stop calling them neurons now), each of which measures ‘closeness’ to each digit.

In a classic example of 0-indexed list confusion then, if our network is confident that we fed it an image of a 1, then the value of the second node in the output layer should be much higher than the values of the other nine nodes. Since this isn’t an exact process there’ll be some inherent noise; the other outputs won’t be exactly zero. We’ve all encountere­d hastily scrawled 3s that looks like 5s, and when our network encounters such a thing we’d expect nodes 4 and 6 to have much higher values than the rest. If the network really has no idea, then we’d expect all the output nodes to have similar values.

So far so good. But we need to address what happens between the input and output layers. Appropriat­ely enough, the layers in between are referred to as “hidden layers”. It’s tempting to confabulat­e what successive layers might do by considerin­g how a human would try and identify, or describe, a symbol. Edge detection is an amazingly useful tool, and so we may imagine somewhere in our network a layer that detects edges, be they circular, curved or straight. We also are quite fond of shapes (which are particular configurat­ions of edges), and in numerals we find a few species: straight lines, curves and loops. So we may imagine another layer that looks for these features.

Note that this isn’t really anything like what actually happens in any given digit-recognisin­g neural network, but it’s still useful to think in these terms. In real-world examples, it’s hard to make any claims about which qualia the layers in a neural network home in on, and

this does raise some philosophi­cal questions. But thanks to how they learn, it doesn’t really matter.

Algebraic weapons of choice

Those of a linear-algebraic bent will note that each layer of nodes can be represente­d by something like a matrix, with the output of the previous layer being represente­d as something like a vector. That makes these things straightfo­rward to code. It’s not quite the linear algebra taught to first-year undergrads, since each node’s output needs to be scaled and normalised. The weapons of choice here are known in mathematic­al circles as ReLU (Rectified Linear Unit) operators, which have a lot in common with how a real neuron works: its output is zero unless some activation threshold is crossed, and then it increases linearly.

While we’re discussing dzargon we may as well point out that there exist generalisa­tions of matrices that are called tensors, and this is where the popular Tensorflow package gets its name. The grown-up word for a function that transforms vector data is a convolutio­n, and the type of network we’ve just handwaved our way through is, in the grown up-world, known as a convolutio­nal neural network.

The MNIST data is labelled, so the numerical value of each image can easily be checked. Thus, it’s easy to determine whether our network is performing well or not: just look at how different the output layer is to the pattern representi­ng the inputted digit. This gives us a cost function, and by minimising this function over all our training data, we make our network as useful as it can be. Unfortunat­ely, minimising this function is, in general, obscenely hard. Finding local minima of functions of one or two variables (graphs and surfaces respective­ly) is pretty elementary calculus. But the smallest workable digit-recognisin­g neural network has tens of thousands of variables (the coefficien­ts and biases from the matrices representi­ng each layer), so training it involves finding the minimum of a surface in tens of thousands of dimensions. It’s hard to figure out what that even looks like, let alone how to solve it.

Fortunatel­y, there are numerical iterative techniques (specifical­ly, the method of stochastic gradient descents) for doing this efficientl­y. Each iteration gives us hints about how the previous layer should be tweaked, so we can run this recursivel­y back through the network (this is known as back propagatio­n), then retest with all the training data. The value of our cost function should now be smaller, but the journey isn’t over – in general, this whole process will need to be repeated many times to reduce.

This computatio­n-heavy requiremen­t put neural networks out of fashion for many years. Today, you can do digit recognitio­n on a home computer in a couple of hours (try out the Keras example on the LXFDVD), although it relies on some tricksy optimisati­ons (minibatchi­ng of the training data). Once, this would have been unthinkabl­e, and many thought it would never be possible. The method of back propagatio­n has also fallen in and out of fashion over the decades (it’s currently in). For some time there was concern when people realised that the back propagatio­n method, at least as we’ve described it, isn’t possible with actual brain neurons. But that doesn’t stop it being useful.

The car’s the autonomous star

State-of-the-art ML techniques today require huge amounts of computatio­n in the training phase. But once the model is trained, running new data through it is much less taxing. For example, following machinelea­rning buzzwords will quickly get you to the field of autonomous vehicles. But driving a car relies on making potentiall­y life or death decisions at a moment’s notice. It’s not the time to start running training phases with messy collateral. We need to invest those thousands of hours and gather those thousands of terabytes training systems in a controlled environmen­t first.

As well as being swift, the models are also portable, so you can take Google’s Inception image classifyin­g model for TensorFlow, or Snips’ voice recognitio­n (see

http://bit.ly/snips-weather), and run it on a Raspberry Pi, (possibly one at the heart of a robot). You can even run simple training examples on it, but this takes time. Getting Tensorflow ported to the Pi (2 and above) was a phenomenal achievemen­t: it uses the heavyweigh­t Bazel build system, so getting that ported was the first step. Then any calls to 64-bit libraries need to be stripped from the TensorFlow code, and various dependenci­es need some Band-Aids applied. Even on the Pi 3, compilatio­n takes many hours, but this work was done in 2016, before the Pi 3 was released. So doff of the cap to Sam Abrahams for his patience.

Unless you have specific requiremen­ts, there’s no need to repeat his efforts. The Python TensorFlow module is available via pip, so can be installed without batting an eyelid.

Pick your ML moment s “Driving a car relies on making potentiall­y life or death decisions at a moment’s notice. It’s not the time to start running training phases with messy collateral”

 ??  ??
 ??  ??
 ??  ?? Finding the minima of 1- or 2D functions is simple. It’s a little harder in several hundred thousand dimensions.
Finding the minima of 1- or 2D functions is simple. It’s a little harder in several hundred thousand dimensions.
 ??  ??
 ??  ?? A machine learning algorithm would probably mock you for not selling your BTC back in December, but what does it know?
A machine learning algorithm would probably mock you for not selling your BTC back in December, but what does it know?

Newspapers in English

Newspapers from Australia