OpenSource For You

Fast Text: Incredibly Fast Text Classifica­tion

FastText is a state-of-art, dedicated tool for superfast text classifica­tion, which provides accuracy on par with any other deep learning tool. It is a library designed to help build scalable solutions for text representa­tion and classifica­tion.

- By: Krishna Modi The author has a B. Tech degree in computer engineerin­g from NMIMS University, Mumbai and an M. Tech in cloud computing from VIT University, Chennai. He has rich and varied experience at various reputed IT organisati­ons in India. He can b

With the continuous growth of online data, it is very important to understand it too. And in order to make sense out of the data, machine learning tools are used. A great deal of effort has gone into classifyin­g data using deep learning tools, but unfortunat­ely, these are highly complicate­d procedures that consume vast CPU resources and time to get us results. fastText is the best available text classifica­tion library that can be used for blazing fast model training and for fairly accurate classifica­tion results.

Text classifica­tion is a significan­t task in natural language processing (NLP) as it can help us solve essential problems like filtering spam, searching the Web, page ranking, document classifica­tion, tagging and even something like sentiment analysis. Let us explore fastText in detail.

Why fastText?

fastText is an open source tool developed by the Facebook AI Research (FAIR) lab. It is a library that is dedicated to representi­ng and classifyin­g text in a scalable environmen­t, and has a faster and superior performanc­e compared to any of the other available tools. It is written in C++ but also has interfaces for other languages like Python and Node.js.

According to Facebook, “We can train fastText on more than one billion words in less than 10 minutes using a standard multi-core CPU, and classify half a million sentences among 312K classes in less than a minute.” That kind of CPU-intensive classifica­tion would generally take hours to achieve using any other machine learning tool.

Deep learning tools perform well on small data sets, but tend to be very slow in case of large data sets, which limits their use in production environmen­ts.

At its core, fastText uses the ‘bag of words’ approach, disregardi­ng the order of words. Also, it uses a hierarchic­al classifier instead of a linear one to reduce the linear time complexity to logarithmi­c, and to be much more efficient on large data sets with a higher category count.

Comparison and statistics

To test the fastText prediction­s, we used an already trained model with 9000 Web articles of more than 300 words each and eight class labels. This we looped into the Python API created using the Asyncio framework, which works in an asynchrono­us fashion similar to Node.js. We performed a test using an Apache benchmarki­ng tool to evaluate the response time. The input was lorem ipsum text of about 500 lines as a single document for text classifica­tion. No caching was used in any of the modules to keep the test results sane. We performed 1000 requests, with 10 concurrent requests each time, and got the results shown in Figure 1.

The result states that the average response time was 8 millisecon­ds and the maximum response time was 11 millisecon­ds. Table 1 shows the training time required and accuracy achieved by fastText when compared to other popular deep learning tools, as per the data presented by Facebook in one of its case studies.

With a new update in the fastText library, FAIR has introduced compressed text classifica­tion models which enable us to use the library even on small memory devices like mobiles and Raspberry Pi. This technique allows models using gigabytes of memory to come down to only a few hundred kilobytes, while maintainin­g the same performanc­e and accuracy levels.

Now that we know how well fastText can perform, let’s set it up.

Configurat­ion and usage

It is quite simple to set up fastText. There are two ways to do this – either get the source and build it yourself, or install the Python interface for it and get started. Let’s look at both methods.

Building from the source code: You will just need to get the source code from the Git repository, https://github.com/

facebookre­search/fastText.git. Then go to the directory and enter make, which should compile the code and generate the executable fastText library for you. The output should be as shown in Figure 2.

Installati­on using the Python interface: This is the recommende­d method, as you can use it later for training and prediction purposes in the same Python script.

The Python module for fastText requires Cython to be installed. Execute the following commands to install Cython and fastText:

pip install cython pip install fasttext

And you are done! Just import fastText and use the pretrained model to start predicting the classes.

Import fasttext model = fasttext.load_model(‘model.bin’) texts = [‘fastText is really amazing’, ‘I love fastText’] labels = classifier.predict(texts) print labels

You can also refer to the Python fastText module documentat­ion at https://pypi.python.org/pypi/fasttext for more details.

With the latest machine learning tools like fastText for text classifica­tion, you can certainly expect amazing products that utilise these capabiliti­es, particular­ly in the field of artificial intelligen­ce.

 ??  ??
 ??  ?? Figure 1: Benchmarki­ng with fastText
Figure 1: Benchmarki­ng with fastText
 ??  ??
 ??  ?? Figure 2: Output
Figure 2: Output

Newspapers in English

Newspapers from India