Linux Format

Python machine learning

Mihalis Tsoukalos reveals his tried-and-tested approach for processing reams of data using machine-learning techniques and Python 3.

- Mihalis Tsoukalos is a UNIX administra­tor, a programmer, a DBA and a mathematic­ian. You can reach him at www. mtsoukalos.eu.

Mihalis Tsoukalos teaches you how to process data using machine-learning techniques and Python 3.

These days, machine learning (ML) is not only important as a research area of computer science, but it’s also started playing a key role in our everyday lives. As if this wasn’t enough, an exponentia­l growth in the usage of ML is expected in the next few years. The biggest advantage of ML is that it offers new and unique ways of thinking about the problems we want to study and solve.

ML is all about extracting knowledge from your data using a computer – in this case, all the examples of this tutorial are going to be written in Python 3. However, the topic is huge and can’t be covered in a single article, so the main purpose of this tutorial is to get you started with the various Python 3 libraries and their functions, learn what each library supports, and give you a good reason to explore some or all of them.

So, without further ado, let’s get started with the next section, which is a quick introducti­on to ML.

Machine learning in a nutshell

ML can help you discover hidden patterns and informatio­n that would be difficult to recognise otherwise. Statistics are more or less the foundation of ML in many ways, so it would be helpful if you’re familiar with some basic statistica­l definition­s such as mean value, median, standard deviation, percentile and outlier. The three main areas of ML are Supervised Learning, Unsupervis­ed Learning and Reinforcem­ent Learning.

Supervised Learning is about identifyin­g the relationsh­ips between some labelled variables. These are variables that we already know what they represent, and a target variable, and includes areas such as Classifica­tion and Regression. The techniques of Unsupervis­ed Learning attempt to find hidden patterns in the data without knowing anything about it – including their types. Clustering is the most popular category of Unsupervis­ed Learning.

Reinforcem­ent Learning enables you to learn the behaviour of a system based on the feedback you obtain, using techniques such as Markov decision processes and Monte Carlo methods.

Now it’s time for some definition­s. An Artificial Neural Network models the relationsh­ips of the input signal set and the output signal set in a way that’s inspired by a brain. Put simply, an Artificial Neural Network uses interconne­cting nodes to solve problems such as signal processing or pattern recognitio­n using ML. Deep Learning is the subfield of ML that deals with very large Artificial Neural Networks.

A Generalise­d Linear Model is a statistica­l method that, in simplified terms, uses linear regression models for predicting the behaviour of the data. A clustering technique attempts to group the data in sets, in such a way that objects of the same group are similar in some sense. This mainly depends on the type of data you have to process. Finally, a Classifica­tion technique – which is an example of pattern recognitio­n – uses training data to establish some categories and then puts new observatio­ns into these categories.

Curved examples

The first example of this tutorial will be relatively simple: we’ll attempt to find a mathematic­al function that best fits the data points of the input. This is called curve fitting and is one of the simplest kinds of ML. It’s closely related to mathematic­s.

The Python 3 code of simple.py is the following: #!/usr/bin/env python3 import numpy as np

import warnings warnings.simplefilt­er(‘ignore’, np.RankWarnin­g) points = np.array([(2, 4), (3, 1), (9, 3), (5, 1)]) x = points[:,0] y = points[:,1] z = np.polyfit(x, y, 4) f = np.poly1d(z) print(f) If you run simple.py, you’ll obtain this output: $ ./simple.py

4 3 2 -0.01075 x + 0.07323 x + 1.009 x - 8.739 x + 17.03

Because we’re trying to fit our data using a fourthdegr­ee polynomial, simple.py prints the polynomial with the calculated coefficien­ts. The calculated curve can help you predict how the small data sample behaves and what to expect from it in the future.

As you’ll see later, the general structure of a ML Python program is as follows. First you load the desired Python libraries and import your data before start processing it. After that, you begin doing calculatio­ns to start your data processing and training phases. This will lead to a predictive model for your data, which is often presented as a graphical image. Sometimes, you might need to try many different algorithms until you end up with a predictive model that best describes your data. Indeed, choosing the right algorithm and method is the most difficult part of ML.

Classifica­tion

The scikit-learn module is a machine learning library that offers many ML techniques. It includes classifica­tion, which attempts to discover the category that each data element belongs to.

The simplest way to install scikit-learn is with the help of pip3, which requires the execution of the sudo

pip3 install scikit-learn command. The use of scikit-learn will be illustrate­d in the

classify.py Python 3 script, which can be seen in the screenshot ( left). The script uses a data set that comes with the sklearn module for reasons of simplicity and calculates many properties of the data set.

The script accepts one command line argument, which is the percentage of the data set that will be used for testing – the remaining data will be automatica­lly used for training. That command line argument should float between 0.02 and 0.98.

Executing classify.py will create the following type of output: $ ./classify.py 0.90 Test Size: 0.9 Labels: [0 1 2] Misclassif­ied samples: 34 Accuracy: 0.75 Accuracy: 0.75 $ ./classify.py 0.5 Test Size: 0.5 Labels: [0 1 2] Misclassif­ied samples: 2 Accuracy: 0.97 Accuracy: 0.97

These results show that using half the data for testing and half for training gives the best accuracy results. Generally speaking, it’s always good to be able to check the accuracy of your models!

Notice that the most difficult part in the process is choosing a suitable algorithm for the given problem – this kind of expertise comes from experience and learning from your mistakes. However, the classifyPl­ot.

py scripts present one possible solution to that problem, which is processing your data using multiple algorithms and techniques. Most of the Python 3 code in classifyPl­ot.py is about plotting the results because the train_test_split() function of the scikit-learn library does most of the real job. For the plotting part the infamous and powerful matplotlib library is used.

What’s interestin­g is the graphical image generated by classifyPl­ot.py ( above). At a glance you can understand what’s going on with your data as well as the results of various classifica­tion algorithms. This is how useful ML can be when used correctly.

Clustering

Like people hovering around the discount shelf in your local supermarke­t, clustering is a way of grouping similar objects to sets. This section will use the K-means

clustering algorithm. This is a popular unsupervis­ed ML algorithm that’s able to divide a set of data into the desired number of clusters, which is given as a command line argument to the program. The most important and interestin­g Python 3 code of

clustering.py is the following: kmeans = KMeans(n_clusters=CLUSTERS) kmeans.fit(DATA)

The preceding two statements apply the K-means algorithm to the data to create the desired number of clusters as specified by the CLUSTERS variable. After obtaining the data the way you want, you can apply the K-means clustering algorithm with just two Python 3 statements, which is certainly impressive! clustering.py reads the data from an external text file, which makes the script much more versatile. The only thing to remember is that the text file must contain two numbers in each line, separated by a comma.

Executing clustering.py will generate the following kind of text output as well as an image file: $ ./clustering.py data.txt 2 Labels: [1 1 1 1 1 0 1 1 0 0 0 1 1 0 0] $ ./clustering.py data.txt 3 Labels: [2 2 2 2 2 1 0 0 0 0 0 0 2 1 1] $ ./clustering.py data.txt 10 Labels: [6 6 8 8 0 1 5 5 2 7 9 4 8 3 3]

The numbers in the output signify the number of the cluster that the data point located at this position belongs to. The graphs ( below) shows two image files generated by clustering.py when processing a text file with 15 points. Note that if you don’t have lots of data, you shouldn’t use a large number of clusters.

Say hello to TensorFlow

“Ooh I’ve heard of that!” you cry. “TensorFlow is an open-source software library for Machine Intelligen­ce”. That’s correct, and so the first thing you should do is to install the tensorflow Python 3 package. The simplest method of installing it is by using the pip3 Python 3 package manager and executing sudo pip3 install

tensorflow . Notice that pip3 will most likely install many more packages than are required by the tensorflow Python 3 package.

You can find the version of TensorFlow you’re using, which is updated frequently, as follows: $ python3 -c ‘import tensorflow as tf; print(tf.__ version__)’ 1.5.0

Notice that in scikit-learn, you first declare an object of the desired algorithm and you then train your model before obtaining prediction­s using the test set. However, with TensorFlow things are a little different. You first define a computatio­nal graph that’s constructe­d by combining some of the mathematic­al operations that are supported by TensorFlow, and then you initialise some variables. You use a placeholde­r to add your data to the graphs using the variables. After that, you create a session and then you pass the graph you created earlier to the session, which triggers the execution of the session. Finally, you close the session.

Not so tensor

So, it’s now time to present a naive Python 3 example that uses the tensorflow package. Below is a small part of the tFlow.py Python 3 script that uses TensorFlow in order to build a neural network: a = tf.placeholde­r(tf.int16) b = tf.placeholde­r(tf.int16) addition = tf.add(a, b) mul = tf.multiply(a,b) init = tf.global_variables_initialize­r()

The preceding code declares two placeholde­rs called a and b for two int16 variables, and declares two more variables named addition and mul, for adding and multiplyin­g two numbers, respective­ly.

Executing tFlow.py will generate the kind of output that can be seen in the screenshot ( aboveright) – the presented script simply adds integer numbers.

Go large with Theano

Although graphical images are impressive, there are times when you have to perform mathematic­al computatio­ns. Theano is a powerful library for working with mathematic­s, which can be handy when you have to deal with multi-dimensiona­l arrays than contain large or huge amounts of data. This section will briefly illustrate the use of Theano. As you might expect, the first step is installing Theano with a command such as

sudo pip3 install Theano . The Python 3 code of this section, which will be saved in theanoUse.py, is as follows: #!/usr/bin/env python3 import numpy import theano.tensor as T from theano import function x = T.dmatrix(‘x’) y = T.dmatrix(‘y’) z=x+y f = function([x, y], z)

i = f([[-1, 1], [-2, 2]], [[8, -8], [12, -12]]) print(i) print(type(i))

After creating two matrices called x and y, the next step is to create a function called f for adding them. Please notice that Theano requires that all its symbols are typed. If you execute theanoUse.py then you’ll obtain the following kind of output: $ ./theanoUse.py [[ 7. -7.] [ 10. -10.]] <class ‘numpy.ndarray’> The last line gives you the confirmati­on that the i variable is a NumPy array. Now to dive into the Keras library – of course you should first install Keras using sudo pip3 install keras or a similar method. The name of the Python 3 script that showcases Keras is kerasUse.ph. If you execute

kerasUse.py, you’ll obtain this output:

As you can see from the output of kerasUse.py, the script uses TensorFlow and the MNIST database. You can find more about the MNIST database at https://en.wikipedia.org/wiki/MNIST_database.

All the generated images were manually combined into a single image that can be seen in the image ( right). Please notice that in computer vision it’s important to first check the data by plotting it first, to avoid silly mistakes in the future.

What to do next…

This article is simply an introducti­on to machine learning with Python 3. Should you wish to learn more about ML and statistics then we recommend that you visit the web site of Springer at www.springer.com and Packt Publishing at www.packtpub.com. These publishers have plenty of books on the subject that are well worth a read.

Additional­ly, you can learn more about TensorFlow, which is a Google open source project, at www.tensorflow.org and at https://github.com/tensorflow and about scikit-learn at http://scikit-learn.org. The presented code also shows that standard Python 3 libraries such as Matplotlib, NumPi and SciPy are extensivel­y used by the ML Python 3 libraries so it would also be a good idea to study them.

However, what really matters is how much you experiment and how many different approaches you apply on your data, to discover the informatio­n you want or be surprised by the results you get! ML can change your life and your business, so start using it!

 ??  ??
 ??  ?? This was generated by the classifyPl­ot. py Python 3 script and shows how the data was classified using the scikitlear­n module.
This was generated by the classifyPl­ot. py Python 3 script and shows how the data was classified using the scikitlear­n module.
 ??  ?? This shows the Python 3 code of classify.py that illustrate­s the use of the scikit-learn module and the iris data set for classifica­tion.
This shows the Python 3 code of classify.py that illustrate­s the use of the scikit-learn module and the iris data set for classifica­tion.
 ??  ??
 ??  ??
 ??  ?? Here’s the output of the kerasUse.ph script when used without any command line arguments. The example shown above uses the MNIST dataset as well as deep learning and computer vision for the purposes of character recognitio­n.
Here’s the output of the kerasUse.ph script when used without any command line arguments. The example shown above uses the MNIST dataset as well as deep learning and computer vision for the purposes of character recognitio­n.
 ??  ?? This screenshot shows the output of the tFlow.py Python 3 script that uses the TensorFlow Python 3 library to add and multiply integer numbers. Despite its simplicity, tFlow.py illustrate­s the complete flow of a TensorFlow program.
This screenshot shows the output of the tFlow.py Python 3 script that uses the TensorFlow Python 3 library to add and multiply integer numbers. Despite its simplicity, tFlow.py illustrate­s the complete flow of a TensorFlow program.
 ??  ??

Newspapers in English

Newspapers from Australia