Machine learning bots

Dan Frost sets up a system alert system using machine learning methods.

2018-07-31 - Dan Frost leads R&D projects in edtech, writes and podcasts about emerging tech, and business at blog. thebaseline.co.

Discover how you can identify when your system’s coming under attack from malicious bots, with the help of Dan Frost’s machine-learning expertise.

The world is digital, which means that communication is digital. Thus it follows that your digital products communicate with humans and other digital systems, creating millions of events every day.

Unhelpful, nasty people can exploit this to create malicious bots that either pretend to be humans or exploit holes in your systems. If you’ve ever had to deal with a hack and if you’ve done forensic analysis of what happened, you’ll know that hack attempts and malicious code look suspicious.

In this article, we’ll see how to use a combination of the logs of your own system and some accessible machine learning to create a system that alerts you when something suspicious happens. In this context, this means URLs that look like the attacker is trying to exploit some common security flaws.

As you work through the examples, bear in mind that we’re using machine learning to classify a stream of data and so any similar stream of data could be classified in the same way. Anything that you can turn into streams of words − be they human readable words or other specific tokens − can benefit: tweets, code activity, money transfers, database calls or any other stream.

As with any machine-learning model, the success is a combination of the data and the tuning, both of which we’ll look at in this tutorial.

The data we need

The approach in the example in this article is to generate training and test data by using publicly available URLs from the BBC’s sitemap.xml files. That provides the ‘good’ URLs that look like users using the site normally. We’ll also need ‘bad’ URLs, so we manipulate the good ones to create ‘bad’ URLs that look like hack attempts because this is what hackers often do. To do this, we add some simple examples of normal hack attempts to the end of the URLs.

To make this useful for your site, you need to collect lots of logs from Apache or whichever web server you use, and then categorise them into ‘good’ and ‘bad’. What we look at here illustrates the process, but doesn’t provide enough data for accuracy.

In the example, we’re only using URLs and not more detailed data such as IP addresses, date, time and other details, but if you want to create a more robust and usable system, these are definitely features to add in.

Creating training data

One approach for generating training data is to categorise it yourself, but this can take some time. Fear not – we’ve created a handy script which will inject some typical nasty examples on top of some common URLs. Put the following URLs in a file called url_classify.py:

def gen_urls(): feeds = [ ‘https://feeds.bbci.co.uk/news/uk/rss.xml’, ‘http://feeds.bbci.co.uk/news/world/rss.xml’, ‘http://feeds.bbci.co.uk/news/business/rss.xml’, ‘http://feeds.bbci.co.uk/news/politics/rss.xml’, ‘http://feeds.bbci.co.uk/news/technology/rss.xml’, ‘http://feeds.bbci.co.uk/news/health/rss.xml’, ‘http://feeds.bbci.co.uk/news/england/rss.xml’, ‘http://feeds.bbci.co.uk/news/northern_ireland/ rss.xml’, ‘http://feeds.bbci.co.uk/news/scotland/rss.xml’

urls = [] for feed in feeds: d = feedparser.parse(feed) print(“Feed loaded”) for item in d[‘entries’]: urls.append(item[‘links’][0][‘url’]) print(“Loaded {} urls”.format(len(urls))) return urls def gen_hack_urls(): bad_bits = [ ‘?badbad?attr=1of98wef09e&username=wefwef’, ‘../../etc/passwd’, ‘../this/../does/not/look/great’, ] urls = gen_urls() ret = [] for u in urls[0:100]: ret.append(u + bad_bits[0]) ret.append(u + bad_bits[1]) ret.append(u + bad_bits[2]) return ret You can play with this by running python -i url_ classify.py on the command line.

Preparing the data: tokenizing URLs

To use machine learning on words we have to turn the words into a numerical representation. Methods for doing this include ‘bag of words’, which swaps each unique word for a number, and embeddings, which seeks to find some underlying meaning in the words so that “cat” and “dog” would be similar, for example.

However, many of the libraries for turning raw word data into training data strip out the punctuation. This turns out to be a problem in our case, but an interesting problem nevertheless.

You could think of a URL as a sentence because it reads a lot like a description of a location. For example,

http://example.com/images/panda.jpg reads like, “in example.com, in images, the panda picture”. As URLs grow more complex, the symbols become as meaningful as the words such as in http://example. com/login?redirect=dashboard/account&sess =abc123. Here, the question mark, the equals and ampersand all have meaning; without them, the series of other words is ambiguous.

When looking at many hack attempts that come from scripts, the URLs are full of similarly meaningful symbols. We have to find a way to turn what would otherwise be stripped out into something that’s used. To do this, we simply swap each symbol for some kind of token. For example: http://example.com/login?attr=123 becomes: TKNHTTP example com TKNSLASH login TKNQM attr TKNEQ 123

The tokens – those words that start “TKN” – are now part of the sentence and can be used to train the model. To apply this to our data we can use the following script: def tokenize(url): tokens = {

‘http://’: ‘’, ‘https://’: ‘’, ‘/’: ‘TKNSLASH’, ‘www.bbc.co.uk’: ‘BBCCOUK’, ‘example.com’: ‘EXAMPLECOM’, ‘?’: ‘TKNQM’, ‘=’: ‘TKNEQ’, ‘..‘: ‘TKNDBLDOT’, ‘uk-northern-ireland’: ‘uknorthernireland’ } for original,replacement in tokens.items():

url = url.replace(original, “{} “. format(replacement)) return url Add this to url-classify.py and fire up Python interactively with python -i url-classify.py : tokenize(‘https://example.com/this/is?a=url’) which will return: ‘EXAMPLECOM TKNSLASH this TKNSLASH is TKNQM a TKNEQ url’

You can add more tokens such as the square brackets [ and ] or quotes because that’s how people hack databases, by exploiting how more complex systems pass information in URLs. For our purposes here, we’ll leave the tokenisation at that but in your application you’ll want to extract as much meaning as possible because the machine learning algorithms rely on features (meaning) in the original data.

Next, in order to train the model, we need to merge the good and bad URLs and create a list of labels. This is achieved with the following method: def gen_training_data(): good_urls = gen_urls() bad_urls = gen_hack_urls() all_urls = [] all_labels = [] for u in good_urls: all_urls.append(u) all_labels.append(0) for u in bad_urls: all_urls.append(u) all_labels.append(1) return list(map(tokenize, all_urls)), all_labels all_urls_tokenized, all_labels = gen_training_data() This nicely brings all the data together, and makes use of the tokenize method at the end of the gen_ training_data method.

Vectorize

Machine learning works on numbers, not words and so we have to transform our tokenised URL words somehow. This is done by vectorising them, which involves transforming each line of words into a vector based on the words in that line. If you haven’t come across vectorising before, imagine a 2D chart with ‘dog’ on the X axis and ‘bone’ on the Y. Consider the sentence “My dog prefers this bone to that bone” and plot how many times each word appears in the sentence. You have vectorised part of the sentence to represent it in what’s called vector space. To vectorise documents with many hundreds or thousands of words, we do the same with many dimensions, not just the two in our simple dog and bone example.

Get vectorising

There are many libraries for doing this. We’re doing to use one from sklearn, which is the Python machine learning library. If you haven’t installed sklearn, do so with pip install sklearn or pip3 install sklearn depending which Python version you’re using.

Next, we add the vectorisation step: all_urls_tokenized, labels = gen_training_data() vectorizer = CountVectorizer(analyzer = “word”, \ tokenizer = None, \ preprocessor = None, \ stop_words = None, \ max_features = 5000) vectorizer.fit_transform(all_urls_tokenized) train_data_features = vectorizer.transform(all_urls_ tokenized)

If you aren’t familiar with vectorisation, it’s interesting to look at the features here to see how our tokenizing has been interpreted, since from here on it’s all numbers. To see this, you need to run Python in interactive mode: $ python -i url-classify.py vectorizer.get_feature_names() You can explore the vectorised URLs as shown in the screen grabs on the previous pages.

Training the model

Having prepared the data and transformed it, we’re ready to train the model. This step involves the most complex work; however, from our point of view it’s pretty opaque: model = RandomForestClassifier() model.fit(train_data_features, all_labels)

This creates a RandomForestClassifier, a machinelearning algorithm that uses a set of random forests to classify existing data. Under the hood, each of the training examples is improving the structure of the random forest so that it accurately reflects the nature of the data.

The purpose of splitting the data is to have some of the data for the training and some for testing to see how well the newly trained model performs. Because our sample is pretty small is size and narrow in examples, it doesn’t perform that well but we can get a sense of how to use it with: bad_urls = gen_hack_urls() bad_urls_tkn = list(map(tokenize, bad_urls)) bad_urls_tkn_vctr = vectorizer.transform(bad_urls_tkn) p = model.predict(bad_urls_tkn_vctr)

What you’re looking for here is to predict that the urls are always bad. Sometimes the model words, which is great. But as we see below, to get this working for your context you’re going to need to tune things and improve the data.

So far, you’ve built the main parts of a machine learning system that could potentially spot malicious traffic. This could be used for identifying hack attempts, locking down access to services or identifying when systems were hacked in retrospect.

However, it doesn’t perform all that well because it’s a toy example, so let’s look at how you could take this basic structure and improve it.

Tuning the model

The biggest gap here is the available data. We have around 400 good URLs and 300 very similar bad URLs, which isn’t a lot to go on. Many people who run websites are sitting on a lot of web server logs, and so simply finding those logs, extracting the URLs with a script and then using those is a great place to start.

You can find hack attempts either by trawling through the logs manually or using existing open source tools to find them. You would then split the logs into two lists: good URLs and bad URLs. If you wanted to go further then you could actively try to hack your own site or use penetration testing services to generate hack attempts.

If you use open source software, you might be able to team up with other people who use the software to pool all the training data.

The URLs are just one feature of a web server log, so you can get more out of the raw log data. For example, IP addresses, time of day (which is relevant if hacks are related to time zones), and requests per minute might reveal more detail. You could add a feature to each row which is the number of requests which came from the same IP in the last few seconds or minutes. At some point, this number might indicate that the activity is malicious, which isn’t obvious just from the URL that’s being requested.

When you have more and better data, you can also experiment with the model training hyperparameters.

For example, changing max_features may alter the accuracy. Testing over a number of values might reveal an accuracy/speed trade off. RandomForestClassifier(max_features=100)

The max_depth hyper-parameter determines the depth of the tree, which can have a huge effect on accuracy. Good practice, although outside of what we have room for here, is to plot a hyper parameter on the x axis and the accuracy on y to find the optimal value for each.

Put your knowledge into practice

The most important thing to do when learning machine learning is to use what you’ve built, rather than having lots of toy examples sitting around. This could be simply be running a script on your server each night so that you can see first hand what works and what doesn’t. Over time, you’ll hone the features and model parameters so you end up with something useful.

?? ?? The vectorizer.get_feature_names() shows how the words are used to describe each URL. — The vectorizer.get_feature_names() shows how the words are used to describe each URL.

?? ?? Our two datageneration methods give us some starting training data — Our two datageneration methods give us some starting training data

?? ?? Like many libraries, sklearn has a heap of ensemble methods. It’s easy to plan with these once you’ve got the training data and model building steps. — Like many libraries, sklearn has a heap of ensemble methods. It’s easy to plan with these once you’ve got the training data and model building steps.

Machine learning bots

Dan Frost sets up a system alert system using machine learning methods.

Newspapers in English

Newspapers from Australia