Machine Learning 101: Decision trees and forests

They’re the workhorses of classification in machine learning. Darren Yates explains how decision trees and forests make predictions and discover knowledge.

2018-08-01 -

Machine learning might seem one of those tech fields at the moment that has its hype knob wound up to 11, but that’s only because there’s seemingly no end to what it’s being applied to. Everything from property investment to saving indigenous languages is on the receiving end of this melting pot of computer science, maths and statistics we call ‘machine learning’. Last time, we introduced the concept of classification, how machine learning can find hidden patterns within data and predict the class of future ‘unseen’ instances. We introduced the learning method called ‘decision trees’ and how they are a classic form of classification. This month, we delve deeper and look at not just how a decision tree works, but also at its popular offshoot called ‘ensemble learning’.

QUICK RECAP

Classification is a form of machine learning that analyses a training dataset, which often is just a spreadsheet, where each row is a separate but similar event or ‘instance’ and each column represents a feature or ‘attribute’ of that instance. The last attribute is called the ‘class attribute’, for it indicates the category or ‘class’ the instance belongs to. For example, a car dealership may keep track of customers who want to test drive a new car. Each customer is an instance/row in the spreadsheet and the columns represent attributes of the customers — for example, it could include their address ( just suburb or postcode), their salary range (under $50,000, $50k to $75k and so on), their age, sex and finally, the class attribute would be whether or not they purchased a car (yes or no). Using classification in general and a decision tree algorithm in particular, the aim would be to learn from the data the factors that determine which customers are more likely to buy a car after a test drive.

HELLO, DATA

It’s an unwritten law that the first thing you must code in any new language is to print “hello world” to the console or output screen. If machine learning has an equivalent, it’s hacking around with the ‘Iris’ dataset. This dataset dates back to the mid-1930s and has been popular ever since. It contains 150 samples or instances of Iris flower, broken into three groups, 50 of each, representing three species, Iris virginica, Iris versicolor and Iris setosa. Each instance contains five attributes — petal length, petal width, sepal length, sepal width and flower type or ‘class’. The petal is the main colourful part of the flower and the sepal, the outer husk or wrapping. The four measurement attributes are numeric and the class attribute has one of three values to match the flower type — virginica, versicolor or setosa. What we want to do is see if there is a way to accurately predict the flower type just from the petal and sepal dimensions of observed examples.

THE DECISION TREE

Last issue, we talked about classification being a two-step process — first, learning the pattern or ‘model’ from the training dataset, then second, using the model to predict the class attribute value of a new previouslyunseen instance.

Grab the latest version of the Weka data-mining suite for your PC platform from www.cs.waikato.ac.nz/~ml/weka/

downloading.html. If you don’t have Java, we’d suggest the ‘weka-3-82jre-x64.exe’ version for Windows users. Install it, launch the ‘ Weka 3.8’ shortcut and when the ‘Weka GUI Chooser’ appears, select ‘Explorer’ from the right-side button menu list.

After the Weka Explorer panel appears, ensure the ‘Preprocess’ tab is selected and press the ‘Open file...’ button. Select the ‘data’ subfolder in your Weka install folder (typically under ‘/Program files/ Weka-3-8’ and select ‘iris.arff’. Click the ‘open’ button.

LEARN THE MODEL

Now click on the Classify tab, select the Choose button and from the drop-down context menu, select the Trees option and choose J48. Ensure, the radiobutton next to ‘cross validation’ in the ‘test options’ group area is selected and the Folds entry reads ‘10’. Press the Start button. As we mentioned previously, J48 is Weka’s version of the C4.5 algorithm and within a millisecond or two, the cross-validated accuracy result of the learned model appears in the ‘Classifier Output’ window. You should see the learned model correctly classifies 144 out of the 150 instances, for a 96% accuracy rating. That tells us how good the model is, but not what it actually looks like. In the ‘results list’ panel on the bottom-left, right-click on the ‘trees.j48’ entry and select ‘ Visualise tree’ — what you now see is the actual decision tree based on the learned model.

HOW THE DECISION TREE WORKS

There are many types of decision tree algorithm — we’re looking at C4.5 here, but others such as CART are also incredibly popular. Nevertheless, the tree structure from each algorithm works in the same way and although most experts like to consider them as a flowchart, it’s just as easy to think of decision trees as a roadmap. If you drive to work, your journey is actually a series of decisions. Each road junction is a decision on which path you take — turn left or right, or go straight ahead. What’s more, each decision leads to a new decision that progresses you further along a specific path to your destination. A decision tree is very much like this — but instead of turning left or right, the ‘directions’ for our decision tree here are set by the instance’s attribute values.

Looking at the decision tree model diagram, it’s made up of three parts — each oval or ‘node’ represents an attribute test, lines dropping down from the node are the possible results from each test and the boxes are the ‘leaf nodes’ that hold the possible class value options we want to obtain.

Imagine we receive a new Iris record or instance that has no class attribute value — in other words, we don’t know what type of Iris flower it is. The record looks like this:

{sepallength= 6.5; sepalwidth=3.4; petallength=5.9; petalwidth=1.7; class=?}

Now that we have our decision tree model, we can use it to predict the class value for this record. Looking at the tree, we start at the top with the ‘petal width’ attribute, which in our new record, has a value of ‘1.7’. This is greater than 0.6, so we take the right branch. This takes us to the next node, which is again the ‘petal width’ attribute, however, this time, the test is whether or not it’s less than/equal to or greater than 1.7. Since our record value of 1.7 equals the test value, we now take the left-branch and head to ‘petal length’. Our record value here is ‘5.9’, which is greater than 4.9, so again, we take the right branch and come up against another ‘petal width’ attribute test. Since the record’s value for this attribute of 1.7 is greater than the 1.5 test value, we take the rightbranch again, but this time end up at the leaf node ‘iris-versicolor’.

So based on our decision tree roadmap, we travel down the tree and end up at the leaf ‘iris-versicolor’, which becomes the class value predicted. Given the 96% classification accuracy noted before, we’d be

reasonably confident this is the correct result.

DECISION TREE ADVANTAGES

One reason why decision trees are popular is that provided you can follow a map, you can easily find your way through a decision tree. What’s more — and this is important — you don’t need to be a domain expert to decipher the learning. I can tell a gum tree from grass, but that’s about it, yet I can now tell you with 96% accuracy how to determine the type of Iris flower from its petal dimensions alone thanks to our decision tree model.

DECISION TREE DRAWBACKS

However, a decision tree can only ever reproduce one map — it may be a great map that accurately predicts the class value of a large percentage of instances in your training dataset under cross-validation testing, but it’s still just one map. Chances are, it won’t catch every instance. This is where the idea of ‘ensemble learning’ and decision forests came about — if one tree can’t cover every instance, why not have many trees? Conveniently, multiple decision trees become a decision ‘forest’, part of a field known as ‘ensemble learning’.

ENSEMBLE LEARNING

Again, just as you’ll find many decision tree algorithms, there are as many, if not more, ensemble learning methods. To keep things simple, there are three key points to remember about ensemble learning and decision forests in particular.

First, these methods build multiple trees or ‘ base classifiers’. The technique for finding the class of a new instance is pretty similar to what we did before with the decision tree, except this time, we test the new instance against every tree ‘map’ and record the predicted class value from every tree. Let’s say we used a particular ensemble method to generate 100 trees and we put our new instance through all 100 trees. We tally up the results and 73 trees say the class is ‘iris-versicolor’, 21 trees say ‘iris-setosa’ and six ‘iris-virginica’.

The second point is that the overall class is selected by ‘voting’. In this case, since 73 of 100 trees (73%) say ‘iris-versicolor’, we can decide this becomes the class attribute value for that instance. Not surprisingly, this method is called ‘majority voting’. Again, there are multiple voting methods — in this one, every tree has equal weighting, but other voting techniques may weight the votes from trees, giving more credence to the vote of some trees over others. These are collectively called ‘weighted voting’.

Third, it’s important the ensemble trees or classifiers are diverse, that they vary in their attributes and values forming the tests. Why is this important? On a simple level, if the trees are all the same, you’re going to get the same answer from every tree. It’d be like employing ten people who only ever give you the same answer. Why not just employ one and save the wages? Seriously, it’s through tree diversity that you are more likely to discover other models that may tell you about a subgroup of instances the first decision tree model doesn’t pick up. Ultimately, the aim is greater classification accuracy. Through multiple diverse trees, you’re likely to learn more patterns in your data.

EXAMPLE

Click on the Preprocess tab on Weka Explorer, press the ‘Open file...’ button, choose the ‘credit-g.arff’ file from the ‘data’ subfolder and click the Open button. We’ve mentioned a couple of times the example task of determining whether or not a customer loan application represents a good or bad risk, now we’re going to try it ourselves. This ‘credit-g’ training dataset contains 1,000 records based on data from German loan customers. Each record has 20 attributes covering the customer’s details from existing credit history, current savings, loan purpose and so on.

Click on the Classify tab, then the Choose button, select the Trees entry from the context menu and choose J48 as before. Tap Start, and within a second or so, it’ll come back with a model displaying cross-validated classification accuracy of 70.5%. It doesn’t seem too bad, but look at the ‘confusion matrix’ and you’ll see the model correctly picks 117 bad loans as ‘ bad’, but decides 183 bad loans are ‘good’. Imagine an antivirus app deciding 183 pieces of malware weren’t malware at all — that won’t end well.

Go back to the Choose button and this time select ‘Random Forest’. Press the Start button. Random Forest is a type of ensemble learning that in its default form here creates 100 trees with varying levels of diversity. Using those 100 trees, the overall classification accuracy has now risen to 76.4%, an improvement of nearly 6%.

However, look back to the ‘confusion matrix’ and of those 300 bad loans, the ‘recall’ — the proportion of bad loans predicted as ‘ bad’ — has risen only slightly from 0.39 to 0.407 (122/300). Overall, both models learned are good at picking good loans, but struggle at predicting bad loans, an issue that could cost the financial institution providing the loans. Nevertheless, Random Forest does improve the result here, thanks to its multiple trees.

ENSEMBLE METHOD DRAWBACKS

Ensemble methods are incredible useful and as popular as decision trees. However, the fact of having multiple trees — sometimes up to several hundred or more — means the algorithm complexity is greatly increased and the knowledge learned is harder to grasp.

If you take nothing else away from reading this, know that there’s no such thing as the perfect machinelearning algorithm. If you’re interested in machine learning, chances are you’ve heard the term ‘deep learning’, which essentially involves different forms of ‘neural networks’, a type of machine learning that mimics the brain. In the right context, neural networks are great, but, because of their structure, it can be extremely difficult to glean knowledge about what the neural network has actually learned. In the context of ‘supervised learning’, we have here where your training dataset instances are already classified, existing techniques like decision trees and ensemble learning methods are often more appropriate.

Next time, we shift up a gear into the world of ‘R’. See you then.