Python machine learning
Mihalis Tsoukalos reveals his tried-and-tested approach for processing reams of data using machine-learning techniques and Python 3.
Mihalis Tsoukalos teaches you how to process data using machine-learning techniques and Python 3.
These days, machine learning (ML) is not only important as a research area of computer science, but it’s also started playing a key role in our everyday lives. As if this wasn’t enough, an exponential growth in the usage of ML is expected in the next few years. The biggest advantage of ML is that it offers new and unique ways of thinking about the problems we want to study and solve.
ML is all about extracting knowledge from your data using a computer – in this case, all the examples of this tutorial are going to be written in Python 3. However, the topic is huge and can’t be covered in a single article, so the main purpose of this tutorial is to get you started with the various Python 3 libraries and their functions, learn what each library supports, and give you a good reason to explore some or all of them.
So, without further ado, let’s get started with the next section, which is a quick introduction to ML.
Machine learning in a nutshell
ML can help you discover hidden patterns and information that would be difficult to recognise otherwise. Statistics are more or less the foundation of ML in many ways, so it would be helpful if you’re familiar with some basic statistical definitions such as mean value, median, standard deviation, percentile and outlier. The three main areas of ML are Supervised Learning, Unsupervised Learning and Reinforcement Learning.
Supervised Learning is about identifying the relationships between some labelled variables. These are variables that we already know what they represent, and a target variable, and includes areas such as Classification and Regression. The techniques of Unsupervised Learning attempt to find hidden patterns in the data without knowing anything about it – including their types. Clustering is the most popular category of Unsupervised Learning.
Reinforcement Learning enables you to learn the behaviour of a system based on the feedback you obtain, using techniques such as Markov decision processes and Monte Carlo methods.
Now it’s time for some definitions. An Artificial Neural Network models the relationships of the input signal set and the output signal set in a way that’s inspired by a brain. Put simply, an Artificial Neural Network uses interconnecting nodes to solve problems such as signal processing or pattern recognition using ML. Deep Learning is the subfield of ML that deals with very large Artificial Neural Networks.
A Generalised Linear Model is a statistical method that, in simplified terms, uses linear regression models for predicting the behaviour of the data. A clustering technique attempts to group the data in sets, in such a way that objects of the same group are similar in some sense. This mainly depends on the type of data you have to process. Finally, a Classification technique – which is an example of pattern recognition – uses training data to establish some categories and then puts new observations into these categories.
Curved examples
The first example of this tutorial will be relatively simple: we’ll attempt to find a mathematical function that best fits the data points of the input. This is called curve fitting and is one of the simplest kinds of ML. It’s closely related to mathematics.
The Python 3 code of simple.py is the following: #!/usr/bin/env python3 import numpy as np
import warnings warnings.simplefilter(‘ignore’, np.RankWarning) points = np.array([(2, 4), (3, 1), (9, 3), (5, 1)]) x = points[:,0] y = points[:,1] z = np.polyfit(x, y, 4) f = np.poly1d(z) print(f) If you run simple.py, you’ll obtain this output: $ ./simple.py
4 3 2 -0.01075 x + 0.07323 x + 1.009 x - 8.739 x + 17.03
Because we’re trying to fit our data using a fourthdegree polynomial, simple.py prints the polynomial with the calculated coefficients. The calculated curve can help you predict how the small data sample behaves and what to expect from it in the future.
As you’ll see later, the general structure of a ML Python program is as follows. First you load the desired Python libraries and import your data before start processing it. After that, you begin doing calculations to start your data processing and training phases. This will lead to a predictive model for your data, which is often presented as a graphical image. Sometimes, you might need to try many different algorithms until you end up with a predictive model that best describes your data. Indeed, choosing the right algorithm and method is the most difficult part of ML.
Classification
The scikit-learn module is a machine learning library that offers many ML techniques. It includes classification, which attempts to discover the category that each data element belongs to.
The simplest way to install scikit-learn is with the help of pip3, which requires the execution of the sudo
pip3 install scikit-learn command. The use of scikit-learn will be illustrated in the
classify.py Python 3 script, which can be seen in the screenshot ( left). The script uses a data set that comes with the sklearn module for reasons of simplicity and calculates many properties of the data set.
The script accepts one command line argument, which is the percentage of the data set that will be used for testing – the remaining data will be automatically used for training. That command line argument should float between 0.02 and 0.98.
Executing classify.py will create the following type of output: $ ./classify.py 0.90 Test Size: 0.9 Labels: [0 1 2] Misclassified samples: 34 Accuracy: 0.75 Accuracy: 0.75 $ ./classify.py 0.5 Test Size: 0.5 Labels: [0 1 2] Misclassified samples: 2 Accuracy: 0.97 Accuracy: 0.97
These results show that using half the data for testing and half for training gives the best accuracy results. Generally speaking, it’s always good to be able to check the accuracy of your models!
Notice that the most difficult part in the process is choosing a suitable algorithm for the given problem – this kind of expertise comes from experience and learning from your mistakes. However, the classifyPlot.
py scripts present one possible solution to that problem, which is processing your data using multiple algorithms and techniques. Most of the Python 3 code in classifyPlot.py is about plotting the results because the train_test_split() function of the scikit-learn library does most of the real job. For the plotting part the infamous and powerful matplotlib library is used.
What’s interesting is the graphical image generated by classifyPlot.py ( above). At a glance you can understand what’s going on with your data as well as the results of various classification algorithms. This is how useful ML can be when used correctly.
Clustering
Like people hovering around the discount shelf in your local supermarket, clustering is a way of grouping similar objects to sets. This section will use the K-means
clustering algorithm. This is a popular unsupervised ML algorithm that’s able to divide a set of data into the desired number of clusters, which is given as a command line argument to the program. The most important and interesting Python 3 code of
clustering.py is the following: kmeans = KMeans(n_clusters=CLUSTERS) kmeans.fit(DATA)
The preceding two statements apply the K-means algorithm to the data to create the desired number of clusters as specified by the CLUSTERS variable. After obtaining the data the way you want, you can apply the K-means clustering algorithm with just two Python 3 statements, which is certainly impressive! clustering.py reads the data from an external text file, which makes the script much more versatile. The only thing to remember is that the text file must contain two numbers in each line, separated by a comma.
Executing clustering.py will generate the following kind of text output as well as an image file: $ ./clustering.py data.txt 2 Labels: [1 1 1 1 1 0 1 1 0 0 0 1 1 0 0] $ ./clustering.py data.txt 3 Labels: [2 2 2 2 2 1 0 0 0 0 0 0 2 1 1] $ ./clustering.py data.txt 10 Labels: [6 6 8 8 0 1 5 5 2 7 9 4 8 3 3]
The numbers in the output signify the number of the cluster that the data point located at this position belongs to. The graphs ( below) shows two image files generated by clustering.py when processing a text file with 15 points. Note that if you don’t have lots of data, you shouldn’t use a large number of clusters.
Say hello to TensorFlow
“Ooh I’ve heard of that!” you cry. “TensorFlow is an open-source software library for Machine Intelligence”. That’s correct, and so the first thing you should do is to install the tensorflow Python 3 package. The simplest method of installing it is by using the pip3 Python 3 package manager and executing sudo pip3 install
tensorflow . Notice that pip3 will most likely install many more packages than are required by the tensorflow Python 3 package.
You can find the version of TensorFlow you’re using, which is updated frequently, as follows: $ python3 -c ‘import tensorflow as tf; print(tf.__ version__)’ 1.5.0
Notice that in scikit-learn, you first declare an object of the desired algorithm and you then train your model before obtaining predictions using the test set. However, with TensorFlow things are a little different. You first define a computational graph that’s constructed by combining some of the mathematical operations that are supported by TensorFlow, and then you initialise some variables. You use a placeholder to add your data to the graphs using the variables. After that, you create a session and then you pass the graph you created earlier to the session, which triggers the execution of the session. Finally, you close the session.
Not so tensor
So, it’s now time to present a naive Python 3 example that uses the tensorflow package. Below is a small part of the tFlow.py Python 3 script that uses TensorFlow in order to build a neural network: a = tf.placeholder(tf.int16) b = tf.placeholder(tf.int16) addition = tf.add(a, b) mul = tf.multiply(a,b) init = tf.global_variables_initializer()
The preceding code declares two placeholders called a and b for two int16 variables, and declares two more variables named addition and mul, for adding and multiplying two numbers, respectively.
Executing tFlow.py will generate the kind of output that can be seen in the screenshot ( aboveright) – the presented script simply adds integer numbers.
Go large with Theano
Although graphical images are impressive, there are times when you have to perform mathematical computations. Theano is a powerful library for working with mathematics, which can be handy when you have to deal with multi-dimensional arrays than contain large or huge amounts of data. This section will briefly illustrate the use of Theano. As you might expect, the first step is installing Theano with a command such as
sudo pip3 install Theano . The Python 3 code of this section, which will be saved in theanoUse.py, is as follows: #!/usr/bin/env python3 import numpy import theano.tensor as T from theano import function x = T.dmatrix(‘x’) y = T.dmatrix(‘y’) z=x+y f = function([x, y], z)
i = f([[-1, 1], [-2, 2]], [[8, -8], [12, -12]]) print(i) print(type(i))
After creating two matrices called x and y, the next step is to create a function called f for adding them. Please notice that Theano requires that all its symbols are typed. If you execute theanoUse.py then you’ll obtain the following kind of output: $ ./theanoUse.py [[ 7. -7.] [ 10. -10.]] <class ‘numpy.ndarray’> The last line gives you the confirmation that the i variable is a NumPy array. Now to dive into the Keras library – of course you should first install Keras using sudo pip3 install keras or a similar method. The name of the Python 3 script that showcases Keras is kerasUse.ph. If you execute
kerasUse.py, you’ll obtain this output:
As you can see from the output of kerasUse.py, the script uses TensorFlow and the MNIST database. You can find more about the MNIST database at https://en.wikipedia.org/wiki/MNIST_database.
All the generated images were manually combined into a single image that can be seen in the image ( right). Please notice that in computer vision it’s important to first check the data by plotting it first, to avoid silly mistakes in the future.
What to do next…
This article is simply an introduction to machine learning with Python 3. Should you wish to learn more about ML and statistics then we recommend that you visit the web site of Springer at www.springer.com and Packt Publishing at www.packtpub.com. These publishers have plenty of books on the subject that are well worth a read.
Additionally, you can learn more about TensorFlow, which is a Google open source project, at www.tensorflow.org and at https://github.com/tensorflow and about scikit-learn at http://scikit-learn.org. The presented code also shows that standard Python 3 libraries such as Matplotlib, NumPi and SciPy are extensively used by the ML Python 3 libraries so it would also be a good idea to study them.
However, what really matters is how much you experiment and how many different approaches you apply on your data, to discover the information you want or be surprised by the results you get! ML can change your life and your business, so start using it!