MapReduce: the ‘Big Data’ idea inside your Android phone
It’s a common buzzword in ‘big data’, but what is MapReduce, how does it work and why is it in your Android phone?
“Many machine-learning algorithms, like most ‘decision tree’ and ‘forest’ algorithms, require the data to all fit into a computer’s system memory. ”
It’s a few years old now, but IBM’s oft-quoted statement that 90% of the world’s data is created in the previous two years is mind-boggling when you think about it. Basically, we’re swimming in a sea of data that’s rising at a breath-taking rate. As PC users, we’re used to the idea of multi-core CPUs and multi-threaded apps. However, when it comes to machinelearning this ‘big data’, new processing ideas are needed. What might surprise you is that some of those ideas have made their way into your Android phone.
WHAT IS MAPREDUCE?
Many machine-learning algorithms, like most ‘decision tree’ and ‘forest’ (collections of trees) algorithms, require the data to all fit into a computer’s system memory. Before cloud computing, that wasn’t always possible with big data, so ‘distributed computing’ was born. Here, groups or ‘clusters’ of computers each handle part of the data and the results are recombined at the end. The benefits here are speed and cost – by processing the data over multiple computers, the work is completed faster and it also allows the use of cheaper hardware. If you’ve tried SETI@home (setiathome. berkeley.edu) or other similar experiments, they’re perfect examples of this ‘distributed computing’ idea.
MapReduce has been a bit of a ‘big data’ machine-learning buzzword over the last decade or so and refers to its two main functions, used to process data in distributed environments, called ‘map’ and ‘reduce’.
This will be a bit of a simple ‘drawn with crayons’ view of MapReduce, but imagine your data exists as a typical spreadsheet with rows and columns and a single point of data in each cell. The Map function essentially allows a processing task or algorithm to be executed once on each cell, which is then transferred or ‘mapped’ to a new spreadsheet. So for example, if you start with a spreadsheet with 100 rows and 200 columns, you end up with a second processed spreadsheet based on the first. The Reduce function enables another processing task to combine or ‘reduce’ groups of cells to a single result at the end. A simple example often used to describe this is counting the frequency of words in a document. The Map function splits the documents into separate words, while the Reduce function counts up the occurrences of each word.
However, the key thing about the MapReduce framework is that it can be processed in parallel – you can throw
as many processor cores at it as you have to speed things up.
SO, WHAT ABOUT ANDROID?
Take a look at either the Samsung Galaxy S10 or Google’s new Pixel 4 XL phone and each has a variant that includes Qualcomm’s new Adreno 640 graphics processor unit (GPU), with its whopping 384 numeric pipelines or ‘arithmetic logic units’ (ALUs). By contrast, the old Galaxy S5 uses an Adreno 330 GPU with only 128 ALUs – but that’s still 128 processing units able to work in parallel, but which are often employed only when processing 3D images for your favourite games.
The idea of ‘general purpose computing on graphics process units’ or ‘GPGPU’ came about to take advantage of the GPU’s ability at processing simple mathematical tasks in bulk and to apply it to areas other than gaming. The most obvious example in the last five years has been the boom in Bitcoin mining, where multiple graphics cards are often crammed into PC boxes to process blockchain sequences and make new Bitcoins.
However, this highly parallelised processing of relatively simple mathematical task isn’t just limited to PCs – it’s also available in Android devices, thanks to a little known framework called ‘Renderscript’. It’s supported in Android versions going back to Android 2.3/Gingerbread and has been there ever since. What’s clever about Renderscript is that it allows you to develop code that can run on a phone’s CPU or GPU cores without you worrying about which core, the ‘when’ or ‘how’. Android takes care of these issues, but it also decides when a CPU core rather than a GPU core runs your code.
To implement the Renderscript framework, you write an algorithm or ‘kernel’ function that is executed by the Android device on your data. But here’s the thing: Renderscript supports two standard types of kernel – a ‘mapping’ kernel and a ‘reduction’ kernel (sound familiar?).
HOW ANDROID USES MAPREDUCE
In Renderscript, a mapping kernel applies a single-executed transformation function to each value element in a memory block Google calls an ‘allocation’ and which you can think of as a data array, whether a list or a two-dimensional array like a spreadsheet.
Here’s a quick quiz – what data structure does your phone often generate that appears for all the world like a big spreadsheet? If you said ‘photos’, give yourself a prize.
A digital photo is essentially a large two-dimensional spreadsheet where each ‘cell’ is a pixel holding a 24-bit
number, combining three eight-bit blocks identifying the red, green and blue colour components.
In fact, Google uses digital images as programming examples for implementing both Renderscript and mapping kernels to apply real-time transformations.
If you’re interested in trying them out, you’ll need an Android device, plus the latest version of Android Studio, which you’ll find at https://developer. android.com/studio.
Google provides these examples at https://github.com/android/ renderscript-samples.
The ‘BasicRenderScript’ example allows you to change the colour or ‘hue’ of an image in real-time using a slider control, while the ‘RenderScriptIntrinsic’ example allows you to similarly apply various visual effects to an image including blur, emboss and hue – again, all in real-time with parallel processing using your phone’s GPU and/or CPU cores.
If you want to find out more about Renderscript, head to the Google
Developers’ website at developer. android.com/guide/topics/ renderscript/compute. Renderscript has a setup overhead, meaning it takes a certain amount of time to setup before the parallel processing takes place. That also means it’s not ideal for every application, particularly where only small amounts of data are to be processed (here, normal code running on the CPU would likely be more efficient). Still, GPGPU capability on a phone is pretty cool.
MORE THAN ONE USE
MapReduce scored its fame as a buzzword thanks largely to Hadoop, the open-source Java-based big-data distributed computing environment. These days, Hadoop seems to be on the decline, due to the combination of cloud computing and faster alternatives, principally Apache Spark. However, the MapReduce framework is still incredibly useful – and the fact is, thanks to Renderscript, you’re likely carrying it around in your pocket.