FU­TURE OF GPUS

Why it’s all change for graph­ics cards

Maximum PC - - FRONT PAGE - –NEIL MOHR

GRAPH­ICS CARDS have come a long way since the days of 3dfx Voodoo cards. No one ex­pects fixed pipe­lines with fixed func­tions in this day and age, but back at the dawn of 3D ac­cel­er­a­tion, they were ne­ces­si­ties. Sim­pler, hap­pier days. Modern graph­ics cards are a whole brave new world; pro­gram­mable ca­pa­bil­i­ties have opened up fresh av­enues that go be­yond ren­der­ing poly­gons. En­tire server farms now ex­ist solely to power banks of graph­ics cards and their beat­ing GPU hearts.

In this fea­ture, we delve into the world of modern GPUs, and ex­plore how their flex­i­ble and su­per-pow­er­ful new de­signs can vastly out­strip stan­dard pro­ces­sors and power new ap­pli­ca­tions, from ren­der farms and ray trac­ing to cryp­to­min­ing and deep-learn­ing ap­pli­ca­tions.

It’s no won­der Nvidia has seen its server divi­sion’s rev­enue in­crease 10fold over the last few years, from around $70 mil­lion in 2016 to $700 mil­lion to­day. That’s still less than half its gam­ing rev­enue, but the in­crease is an in­di­ca­tion of just how vi­tal this area has be­come to the com­pany as a busi­ness, and how much de­mand there is. And un­der­pin­ning all the hard­ware are soft­ware tech­nolo­gies that en­able de­vel­op­ers to lever­age the abil­i­ties of GPUs in a flex­i­ble man­ner, so they can be treated more like a gen­er­alpur­pose pro­ces­sor, coded with high­level lan­guages, rather than spe­cial­ized low-level hard­ware, re­quir­ing assem­bly lan­guage to get it work­ing.

Over the next few pages, we’re go­ing to take a look at all of this—how the hard­ware works so fan­tas­ti­cally fast, how the soft­ware has de­vel­oped along­side it, and how it’s all be­ing uti­lized in amaz­ing new ways.

LET’S START WITH the hard­ware. What turned those sin­gle-use, fixed-pipe­line, fixed-func­tion graph­ics cards of the past into the GPGPU (gen­eral-pur­pose graph­ics pro­cess­ing unit) com­pute beasts of to­day? The sim­ple an­swer is shaders. It’s not the only an­swer, of course—along­side the de­vel­op­ment of shader hard­ware, process tech­nol­ogy and there­fore the com­plex­ity of graph­ics card ar­chi­tec­ture con­tin­ued to in­crease ex­po­nen­tially. It also helped that there was a mas­sive de­mand for the new tech­nol­ogy, and in new ar­eas.

As a rough out­line, the GeForce 256 (1999) had 17 mil­lion tran­sis­tors on a 220nm process, and the GeForce 2 (2000) had 25 mil­lion tran­sis­tors on a 180nm process; both were non-shader DirectX 7.0 parts. The DirectX 8.0 GeForce 3 (2001) dou­bled this tran­sis­tor count to 57 mil­lion, de­spite be­ing slower than the GeForce 2 Ul­tra at times. We’ll skip the GeForce 4 ,as it ac­tu­ally re­duced its tran­sis­tor count, ru­in­ing our nar­ra­tive. The DirectX 9.0 GeForce FX (2003, so his­tor­i­cally GeForce 5) jumped the tran­sis­tor count mas­sively to 135 mil­lion, and this path con­tin­ues to to­day’s multi­bil­lion tran­sis­tor cards—sheesh. Moore’s Law: No real sur­prise, right?

So, what are all those tran­sis­tors do­ing? As you’ll prob­a­bly be aware, a shader is ef­fec­tively a tiny, lim­ited pro­gram that can process cer­tain types of data stored by the graph­ics card. We’re not here to talk about how the graph­ics pipe­line works, or in­deed graph­ics card ar­chi­tec­ture, so we’re go­ing to gloss over this to a large de­gree, but in or­der to un­der­stand the ca­pa­bil­i­ties and lim­i­ta­tions of a GPGPU, it’s go­ing to be handy to know a lit­tle about shaders.

DirectX 8.0 (2000) is when shader tech­nol­ogy landed for con­sumers—cast your mind back to the GeForce 3 and ATI Radeon 8500. Pixel and ver­tex shaders were im­ple­mented as sep­a­rate units and were su­per-lim­ited. The v1.1 pixel shader was lim­ited to a pro­gram length of 12 in­struc­tions, with only 13 ad­dress and 8 color op­er­a­tions to choose from. The ver­tex shader had no branch­ing, and a max­i­mum 128-in­struc­tion pro­gram length. By the time we reached DirectX 9.0c (2005, Radeon X1x00, and GeForce 6) the spec­i­fi­ca­tion al­lowed for 65,536 ex­e­cuted in­struc­tions, dy­namic branch con­trol, and plenty more be­sides with Shader Model 3.0.

Even at this stage, the GPU could still be thought of as a fixed pipe­line; the clever stuff was start­ing to hap­pen in­side those stand­alone pixel and ver­tex shaders, but they could still only re­ally be used for one job. How­ever, the po­ten­tial was shin­ing through. As an aside, just con­sider that the orig­i­nal ARM in­struc­tion set—which uses a Re­duced In­struc­tion Set Com­puter

ar­chi­tec­ture—had around 22 ma­chine code in­struc­tions (plus eight pseudo op­er­a­tions), which was enough to power a full desk­top com­puter. The po­ten­tial of GPGPU, even if the in­struc­tion set is lim­ited to, say, just 12 in­struc­tions, is there in its vast par­al­leliza­tion, ex­e­cu­tion speeds, and fast lo­cal mem­ory store.

With DirectX 10 (2007), uni­fied shaders be­came a re­al­ity. Tech­nol­ogy had moved on enough that it was pos­si­ble to im­ple­ment a sin­gle, suf­fi­ciently flex­i­ble com­pute ar­chi­tec­ture that could han­dle pixel, ver­tex, and ge­om­e­try shad­ing in one. For ex­am­ple, the GeForce 8800 GTX, launched at the end of 2006, had al­most 700 mil­lion tran­sis­tors, and sup­ported DirectX 10.

By this point, re­search into uti­liz­ing GPGPU ca­pa­bil­i­ties was start­ing to return prac­ti­cal so­lu­tions. The first stabs at uti­liz­ing the ma­trix ca­pa­bil­i­ties of a GPU had be­gun back in 2001, with the first hard­ware im­ple­men­ta­tions of shaders avail­able at the time. Around 2005, the first prac­ti­cal use that ran faster on a GPU than a CPU was widely avail­able—it was a ma­trix op­er­a­tion called LU fac­tor­iza­tion.

DirectX 11 was re­leased as part of Win­dows 7 in 2009, al­though tech­ni­cal pre­views were avail­able from mid-2008. From our point of view, the in­ter­est­ing in­clu­sion was com­pute shaders, a new spec­i­fi­ca­tion de­signed for non-graph­i­cal ap­pli­ca­tions to ac­cess and use the GPU re­sources.

As we’ve al­luded to, when shaders first ap­peared in hard­ware, they were sim­ple enough that de­vel­op­ers were will­ing to hand-write rou­tines in assem­bly code for them. Even so, that’s far from ideal, and as the com­plex­ity of the hard­ware shader im­ple­men­ta­tions in­creased, so did the need for soft­ware ab­strac­tion, which is the fancy way of say­ing that we needed high-level lan­guage sup­port.

First to the API party was Nvidia, with its canny CUDA, re­leased to the pub­lic in June 2007. It’s pos­si­ble that your only en­counter with CUDA is in con­nec­tion with the other Nvidia tech­nol­ogy, PhysX—re­mem­ber when you could buy a sep­a­rate add-in card to speed up your game’s physics? While PhysX started as its own tech­nol­ogy, GPGPU ca­pa­bil­i­ties swal­lowed it up—that’s just one ex­am­ple of a task that’s be­come GPGPU- ac­cel­er­ated. CUDA largely started with one smart chap called Ian Buck, whom Nvidia went on to em­ploy as di­rec­tor of its GPGPU divi­sion. At Stan­ford Uni­ver­sity, he’d been de­vel­op­ing a PhD project called BrookGPU, which was defin­ing a lan­guage to pro­gram GPUs—oddly enough, this be­came the ba­sis of OpenCL.

At Nvidia, he took the chance to re­visit Brook, and tackle its lim­i­ta­tions. The key one be­ing re­stricted mem­ory ac­cess pat­terns. The Nvidia C for CUDA ex­ten­sions threw away those lim­i­ta­tions, and pro­vided a huge pool of threads that could ac­cess the GPU mem­ory any way the coder wanted, with a full im­ple­men­ta­tion of C’s lan­guage se­man­tics. CUDA pro­vides a full de­vel­op­ment tool­kit for pro­gram­ing Nvidia’s CUDA cores, in­clud­ing a com­piler, de­bug­ger, and li­braries. It sup­ports C, For­tran for sci­ence projects, Java, and Python. It also of­fers APIs for other GPGPU frame­works, such as OpenCL and Mi­crosoft’s Direc­tCom­pute.

Nvidia re­ally pow­ered ahead in the mar­ket, and was the go-to tech­nol­ogy in the

While PhysX started as its own tech­nol­ogy, GPGPU ca­pa­bil­i­ties swal­lowed it up.

in­dus­try when it came to GPGPU com­pute power. Be­ing first to mar­ket with its own top-flight im­ple­men­ta­tion of a par­al­lel com­put­ing plat­form re­ally gave Nvidia a head start over every­one else—it’s a strat­egy at which Nvidia ex­cels.

In con­trast, there’s OpenCL (Open Com­put­ing Lan­guage). This is the open in­dus­try stan­dard for par­al­lel com­put­ing that was set up by Ap­ple—oddly enough, at Ap­ple’s 2018 de­vel­oper con­fer­ence, the com­pany an­nounced it will be drop­ping sup­port for OpenGL and OpenCL on all Ap­ple OSes—and is sup­ported by every­one (bar Ap­ple now), in­clud­ing Nvidia. Of more di­rect in­ter­est to Max­i­mumPC read­ers is the fact that it’s AMD’s choice of plat­form for its stream pro­ces­sors.

OpenCL was re­leased a lit­tle over two years af­ter CUDA, in Au­gust 2009. So, it came late to the party, and had the lofty goal of cre­at­ing pro­grams that would run on all de­vices, in­clud­ing stan­dard CPUs, GPUs, DSPs, FGPAs, and more. Based around an im­ple­men­ta­tion of C/C++, APIs for other lan­guages—in­clud­ing Python, Java, and .Net—are avail­able. This flex­i­bil­ity means OpenCL is even avail­able for cer­tain An­droid 7.1 de­vices, and de­vel­op­ers are able to bal­ance loads over avail­able mul­ti­ple GPU and CPU threads.

As you might imag­ine, the later ap­pear­ance of OpenCL and more open de­sign goals did im­pact its early per­for­mance against CUDA. A study at Delft Uni­ver­sity from 2011 showed that OpenCL suf­fered a 30 per­cent per­for­mance hit over CUDA im­ple­men­ta­tions, al­though hand­tun­ing could mit­i­gate the dif­fer­ence to a de­gree; other stud­ies from the same pe­riod put the hit as high as 67 per­cent. How­ever, things have pro­gressed, and it ap­pears that more re­cent im­ple­men­ta­tions run­ning on AMD cards have man­aged to close down this per­for­mance gap—a 2017 re­port ( www. blender­na­tion.com/ 2017/ 04/12/ blender­cy­cles-opencl-now-par-cuda) for Blender 2.79 showed OpenCL on a par with CUDA per­for­mance, al­though CUDA is still the fastest op­tion for Nvidia hard­ware. AROUND THE BLEND At this point, you might be won­der­ing what GPGPU has got to do with you—be­yond the same hard­ware painting pretty 3D scenes on your mon­i­tor at 120fps, that is. It’s cer­tainly taken a good decade or so, but coders are now well aware of GPGPU ca­pa­bil­i­ties, and the tech­nolo­gies to lever­age them. You’ll find a range of ap­pli­ca­tions out there that can take full ad­van­tage of not just Nvidia cards with their CUDA tech­nol­ogy, but also AMD and even po­ten­tially In­tel GPUs (though of­ten these are no faster than the CPU it­self) uti­liz­ing OpenCL.

Be­cause of the head start Nvidia built for it­self, you do tend to find more CUDA soft­ware op­tions avail­able, but you’ll dis­cover that any­thing de­vel­oped in the last five years should be of­fer­ing the same ca­pa­bil­i­ties for OpenCL, too. So, let’s take a quick look at a few in­ter­est­ing soft­ware op­tions you can play with now.

To kick off, we’re go­ing to sug­gest the hugely im­pres­sive open-source project Blender. It’s a prime ex­am­ple of a key use for GPGPU in terms of ray trac­ing, while the open-source na­ture of Blender has meant there is now a bur­geon­ing mar­ket in low-cost cloud ren­der farms, which is one of the busi­nesses driv­ing Nvidia’s data cen­ter divi­sion.

Blender is a mon­strous pack­age to mas­ter, but you can fire off a test ren­der eas­ily enough. A key strat­egy of the Blender Foun­da­tion is to pro­vide all art re­sources

on an open li­cense for every­one to en­joy— grab and in­stall it from www.blender.org/ down­load, then scroll down the page, and head to “Demo Files,” where you can also grab the fa­mous BMW Bench­mark file. If Blender is in­stalled, you can just ex­tract and run the ap­pro­pri­ate CPU or GPU file— the main dif­fer­ence is the op­ti­mized tile size used. In Blender, press F12 to start a ren­der. If you want to fid­dle with the right-hand “Sam­pling” palette op­tions, we’d re­duce “Sam­ples” from 35 to 15—this re­duces qual­ity for speed. Un­der “Per­for­mance,” you’re able to ad­just tile size, too; 32–64 is ideal for CPUs, while 256 is bet­ter for GPUs.

For ref­er­ence, we found our CPU was al­most four times slower than the GPU here (CPU 4:39, GPU 1:22), but your re­sults will in­evitably vary. FRACTAL FUN The long-stand­ing OpenCL pro­gram Man­del­bul­ber ( http://man­del­bul­ber.com) is a 3D fractal ex­plorer that sup­ports OpenCL on AMD and Nvidia hard­ware. Down­load the lat­est 64-bit build from Source­forge, and run it. Click the “Ren­der” but­ton to see how slow things are on your pro­ces­sor. Use “File > Pro­gram Pref­er­ences > OpenCL” to se­lect a suit­able plat­form (AMD or Nvidia) and your GPU de­vice. On Nvidia hard­ware, you might ini­tially see an er­ror, as it de­faults to AMD. Hit “Ren­der” again and things should be far quicker—if the col­ors go an odd pur­ple, change the OpenCL mode to “Medium” or “Full,” rather than “Fast,” over in the right-hand Nav­i­ga­tion palette. We’re not go­ing to try ex­plain­ing any of this pro­gram—it’s just a bit of fun fly­ing around ab­stract 3D frac­tals—but if you’re in­ter­ested, there’s a whole com­mu­nity around it at http://frac­tal­fo­rums.org.

As we’ve men­tioned, GPGPU is fan­tas­tic at tack­ling repet­i­tive math-in­ten­sive prob­lems. One area that fits the bill per­fectly is the field of en­cryp­tion. Gen­er­ally speak­ing, there’s no need for GPGPU ac­cel­er­a­tion for en­cryp­tion, as that’s taken care of by CPU-based AES hard­ware ac­cel­er­a­tion, but one area that can take full ad­van­tage is hack­ing en­crypted files.

The most well-known soft­ware in this area is the open-source pro­gram John the Rip­per ( www.open­wall.com/john), but be­ing con­trar­i­ans, we’re go­ing to look at the Win­dows (and An­droid) based Hash Suite ( http://hash­suite.open­wall.net), de­vel­oped by a John the Rip­per con­trib­u­tor. Both are de­signed to brute-force crack pass­word­pro­tected files—guess­ing a file’s pass­word (the prob­lem is that the per­mu­ta­tions for any non-triv­ial-length pass­word are vast). If you fire up Hash Suite and open its topleft menu, there’s a bench­mark fea­ture; the pro­gram au­to­mat­i­cally takes ad­van­tage of AMD or Nvidia hard­ware. You’ll see a gen­eral 20-fold in­crease in speed over us­ing your pro­ces­sor, at least for our an­cient Core i5-2500K ver­sus GTX 950 sys­tem.

GPGPU tech is cur­rently ex­plod­ing, per­haps not so much in homes—where its main fo­cus re­mains pow­er­ing 3D graph­ics within PCs, phones, and con­soles—but in data cen­ters, high-per­for­mance com­put­ing, re­search labs, and be­yond. That’s go­ing to ben­e­fit all of us, as Nvidia, In­tel, and AMD pour even more re­search ef­fort into en­hanc­ing these al­ready pow­er­ful pro­ces­sors, and de­vel­op­ers get to lever­age them with CUDA and OpenCL.

Bor­ing, repet­i­tive math is the bread and but­ter of a GPGPU.

Gen­er­at­ing frac­tals is a cake­walk for GPGPUs.

Ray trac­ing is ripe for a bit of GPGPU ac­cel­er­a­tion, and Blender is avail­able for all.

Math did that!

In a small ware­house lives 3 exaops of com­put­ing power. Un­be­liev­able. This is what the in­side of a modern deep-learn­ing su­per­com­puter node looks like. Spot the Tesla V100s!

Nvidia’s ded­i­cated data cen­ter GPU, the Tesla V100, is ca­pa­ble of 125 ter­aflops.

In­tel hopes to crack the deep-learn­ing mar­ket in 2019, with its Ner­vana pro­ces­sor.

Newspapers in English

Newspapers from Australia

© PressReader. All rights reserved.