FUTURE OF GPUS
Why it’s all change for graphics cards
GRAPHICS CARDS have come a long way since the days of 3dfx Voodoo cards. No one expects fixed pipelines with fixed functions in this day and age, but back at the dawn of 3D acceleration, they were necessities. Simpler, happier days. Modern graphics cards are a whole brave new world; programmable capabilities have opened up fresh avenues that go beyond rendering polygons. Entire server farms now exist solely to power banks of graphics cards and their beating GPU hearts.
In this feature, we delve into the world of modern GPUs, and explore how their flexible and super-powerful new designs can vastly outstrip standard processors and power new applications, from render farms and ray tracing to cryptomining and deep-learning applications.
It’s no wonder Nvidia has seen its server division’s revenue increase 10fold over the last few years, from around $70 million in 2016 to $700 million today. That’s still less than half its gaming revenue, but the increase is an indication of just how vital this area has become to the company as a business, and how much demand there is. And underpinning all the hardware are software technologies that enable developers to leverage the abilities of GPUs in a flexible manner, so they can be treated more like a generalpurpose processor, coded with highlevel languages, rather than specialized low-level hardware, requiring assembly language to get it working.
Over the next few pages, we’re going to take a look at all of this—how the hardware works so fantastically fast, how the software has developed alongside it, and how it’s all being utilized in amazing new ways.
LET’S START WITH the hardware. What turned those single-use, fixed-pipeline, fixed-function graphics cards of the past into the GPGPU (general-purpose graphics processing unit) compute beasts of today? The simple answer is shaders. It’s not the only answer, of course—alongside the development of shader hardware, process technology and therefore the complexity of graphics card architecture continued to increase exponentially. It also helped that there was a massive demand for the new technology, and in new areas.
As a rough outline, the GeForce 256 (1999) had 17 million transistors on a 220nm process, and the GeForce 2 (2000) had 25 million transistors on a 180nm process; both were non-shader DirectX 7.0 parts. The DirectX 8.0 GeForce 3 (2001) doubled this transistor count to 57 million, despite being slower than the GeForce 2 Ultra at times. We’ll skip the GeForce 4 ,as it actually reduced its transistor count, ruining our narrative. The DirectX 9.0 GeForce FX (2003, so historically GeForce 5) jumped the transistor count massively to 135 million, and this path continues to today’s multibillion transistor cards—sheesh. Moore’s Law: No real surprise, right?
So, what are all those transistors doing? As you’ll probably be aware, a shader is effectively a tiny, limited program that can process certain types of data stored by the graphics card. We’re not here to talk about how the graphics pipeline works, or indeed graphics card architecture, so we’re going to gloss over this to a large degree, but in order to understand the capabilities and limitations of a GPGPU, it’s going to be handy to know a little about shaders.
DirectX 8.0 (2000) is when shader technology landed for consumers—cast your mind back to the GeForce 3 and ATI Radeon 8500. Pixel and vertex shaders were implemented as separate units and were super-limited. The v1.1 pixel shader was limited to a program length of 12 instructions, with only 13 address and 8 color operations to choose from. The vertex shader had no branching, and a maximum 128-instruction program length. By the time we reached DirectX 9.0c (2005, Radeon X1x00, and GeForce 6) the specification allowed for 65,536 executed instructions, dynamic branch control, and plenty more besides with Shader Model 3.0.
Even at this stage, the GPU could still be thought of as a fixed pipeline; the clever stuff was starting to happen inside those standalone pixel and vertex shaders, but they could still only really be used for one job. However, the potential was shining through. As an aside, just consider that the original ARM instruction set—which uses a Reduced Instruction Set Computer
architecture—had around 22 machine code instructions (plus eight pseudo operations), which was enough to power a full desktop computer. The potential of GPGPU, even if the instruction set is limited to, say, just 12 instructions, is there in its vast parallelization, execution speeds, and fast local memory store.
With DirectX 10 (2007), unified shaders became a reality. Technology had moved on enough that it was possible to implement a single, sufficiently flexible compute architecture that could handle pixel, vertex, and geometry shading in one. For example, the GeForce 8800 GTX, launched at the end of 2006, had almost 700 million transistors, and supported DirectX 10.
By this point, research into utilizing GPGPU capabilities was starting to return practical solutions. The first stabs at utilizing the matrix capabilities of a GPU had begun back in 2001, with the first hardware implementations of shaders available at the time. Around 2005, the first practical use that ran faster on a GPU than a CPU was widely available—it was a matrix operation called LU factorization.
DirectX 11 was released as part of Windows 7 in 2009, although technical previews were available from mid-2008. From our point of view, the interesting inclusion was compute shaders, a new specification designed for non-graphical applications to access and use the GPU resources.
As we’ve alluded to, when shaders first appeared in hardware, they were simple enough that developers were willing to hand-write routines in assembly code for them. Even so, that’s far from ideal, and as the complexity of the hardware shader implementations increased, so did the need for software abstraction, which is the fancy way of saying that we needed high-level language support.
First to the API party was Nvidia, with its canny CUDA, released to the public in June 2007. It’s possible that your only encounter with CUDA is in connection with the other Nvidia technology, PhysX—remember when you could buy a separate add-in card to speed up your game’s physics? While PhysX started as its own technology, GPGPU capabilities swallowed it up—that’s just one example of a task that’s become GPGPU- accelerated. CUDA largely started with one smart chap called Ian Buck, whom Nvidia went on to employ as director of its GPGPU division. At Stanford University, he’d been developing a PhD project called BrookGPU, which was defining a language to program GPUs—oddly enough, this became the basis of OpenCL.
At Nvidia, he took the chance to revisit Brook, and tackle its limitations. The key one being restricted memory access patterns. The Nvidia C for CUDA extensions threw away those limitations, and provided a huge pool of threads that could access the GPU memory any way the coder wanted, with a full implementation of C’s language semantics. CUDA provides a full development toolkit for programing Nvidia’s CUDA cores, including a compiler, debugger, and libraries. It supports C, Fortran for science projects, Java, and Python. It also offers APIs for other GPGPU frameworks, such as OpenCL and Microsoft’s DirectCompute.
Nvidia really powered ahead in the market, and was the go-to technology in the
While PhysX started as its own technology, GPGPU capabilities swallowed it up.
industry when it came to GPGPU compute power. Being first to market with its own top-flight implementation of a parallel computing platform really gave Nvidia a head start over everyone else—it’s a strategy at which Nvidia excels.
In contrast, there’s OpenCL (Open Computing Language). This is the open industry standard for parallel computing that was set up by Apple—oddly enough, at Apple’s 2018 developer conference, the company announced it will be dropping support for OpenGL and OpenCL on all Apple OSes—and is supported by everyone (bar Apple now), including Nvidia. Of more direct interest to MaximumPC readers is the fact that it’s AMD’s choice of platform for its stream processors.
OpenCL was released a little over two years after CUDA, in August 2009. So, it came late to the party, and had the lofty goal of creating programs that would run on all devices, including standard CPUs, GPUs, DSPs, FGPAs, and more. Based around an implementation of C/C++, APIs for other languages—including Python, Java, and .Net—are available. This flexibility means OpenCL is even available for certain Android 7.1 devices, and developers are able to balance loads over available multiple GPU and CPU threads.
As you might imagine, the later appearance of OpenCL and more open design goals did impact its early performance against CUDA. A study at Delft University from 2011 showed that OpenCL suffered a 30 percent performance hit over CUDA implementations, although handtuning could mitigate the difference to a degree; other studies from the same period put the hit as high as 67 percent. However, things have progressed, and it appears that more recent implementations running on AMD cards have managed to close down this performance gap—a 2017 report ( www. blendernation.com/ 2017/ 04/12/ blendercycles-opencl-now-par-cuda) for Blender 2.79 showed OpenCL on a par with CUDA performance, although CUDA is still the fastest option for Nvidia hardware. AROUND THE BLEND At this point, you might be wondering what GPGPU has got to do with you—beyond the same hardware painting pretty 3D scenes on your monitor at 120fps, that is. It’s certainly taken a good decade or so, but coders are now well aware of GPGPU capabilities, and the technologies to leverage them. You’ll find a range of applications out there that can take full advantage of not just Nvidia cards with their CUDA technology, but also AMD and even potentially Intel GPUs (though often these are no faster than the CPU itself) utilizing OpenCL.
Because of the head start Nvidia built for itself, you do tend to find more CUDA software options available, but you’ll discover that anything developed in the last five years should be offering the same capabilities for OpenCL, too. So, let’s take a quick look at a few interesting software options you can play with now.
To kick off, we’re going to suggest the hugely impressive open-source project Blender. It’s a prime example of a key use for GPGPU in terms of ray tracing, while the open-source nature of Blender has meant there is now a burgeoning market in low-cost cloud render farms, which is one of the businesses driving Nvidia’s data center division.
Blender is a monstrous package to master, but you can fire off a test render easily enough. A key strategy of the Blender Foundation is to provide all art resources
on an open license for everyone to enjoy— grab and install it from www.blender.org/ download, then scroll down the page, and head to “Demo Files,” where you can also grab the famous BMW Benchmark file. If Blender is installed, you can just extract and run the appropriate CPU or GPU file— the main difference is the optimized tile size used. In Blender, press F12 to start a render. If you want to fiddle with the right-hand “Sampling” palette options, we’d reduce “Samples” from 35 to 15—this reduces quality for speed. Under “Performance,” you’re able to adjust tile size, too; 32–64 is ideal for CPUs, while 256 is better for GPUs.
For reference, we found our CPU was almost four times slower than the GPU here (CPU 4:39, GPU 1:22), but your results will inevitably vary. FRACTAL FUN The long-standing OpenCL program Mandelbulber ( http://mandelbulber.com) is a 3D fractal explorer that supports OpenCL on AMD and Nvidia hardware. Download the latest 64-bit build from Sourceforge, and run it. Click the “Render” button to see how slow things are on your processor. Use “File > Program Preferences > OpenCL” to select a suitable platform (AMD or Nvidia) and your GPU device. On Nvidia hardware, you might initially see an error, as it defaults to AMD. Hit “Render” again and things should be far quicker—if the colors go an odd purple, change the OpenCL mode to “Medium” or “Full,” rather than “Fast,” over in the right-hand Navigation palette. We’re not going to try explaining any of this program—it’s just a bit of fun flying around abstract 3D fractals—but if you’re interested, there’s a whole community around it at http://fractalforums.org.
As we’ve mentioned, GPGPU is fantastic at tackling repetitive math-intensive problems. One area that fits the bill perfectly is the field of encryption. Generally speaking, there’s no need for GPGPU acceleration for encryption, as that’s taken care of by CPU-based AES hardware acceleration, but one area that can take full advantage is hacking encrypted files.
The most well-known software in this area is the open-source program John the Ripper ( www.openwall.com/john), but being contrarians, we’re going to look at the Windows (and Android) based Hash Suite ( http://hashsuite.openwall.net), developed by a John the Ripper contributor. Both are designed to brute-force crack passwordprotected files—guessing a file’s password (the problem is that the permutations for any non-trivial-length password are vast). If you fire up Hash Suite and open its topleft menu, there’s a benchmark feature; the program automatically takes advantage of AMD or Nvidia hardware. You’ll see a general 20-fold increase in speed over using your processor, at least for our ancient Core i5-2500K versus GTX 950 system.
GPGPU tech is currently exploding, perhaps not so much in homes—where its main focus remains powering 3D graphics within PCs, phones, and consoles—but in data centers, high-performance computing, research labs, and beyond. That’s going to benefit all of us, as Nvidia, Intel, and AMD pour even more research effort into enhancing these already powerful processors, and developers get to leverage them with CUDA and OpenCL.
Boring, repetitive math is the bread and butter of a GPGPU.
Generating fractals is a cakewalk for GPGPUs.
Ray tracing is ripe for a bit of GPGPU acceleration, and Blender is available for all.
Math did that!
In a small warehouse lives 3 exaops of computing power. Unbelievable. This is what the inside of a modern deep-learning supercomputer node looks like. Spot the Tesla V100s!
Nvidia’s dedicated data center GPU, the Tesla V100, is capable of 125 teraflops.
Intel hopes to crack the deep-learning market in 2019, with its Nervana processor.