Start sav­ing: Nvidia re­de­fines the high end

Maximum PC - - FRONT PAGE - By Chris Lloyd

YOU EX­PECT a press re­lease to make am­bi­tious claims, but the one from Nvidia for its Tur­ing GPU ar­chi­tec­ture is par­tic­u­larly strik­ing. The head­line tells us that it will “fun­da­men­tally change com­puter graph­ics,” and the com­pany’s CEO, Jensen Huang, says that it is “Nvidia’s most im­por­tant in­no­va­tion in com­puter graph­ics in more than a decade.” Bold talk, but there’s sub­stance be­hind it. Tur­ing is more than a gen­er­a­tional step: It brings gen­uinely fresh tech­nol­ogy to the graph­ics card, with real-time ray trac­ing as star billing.

Tur­ing fol­lows on from Pas­cal, which has pow­ered ev­ery­thing from the top-tier GTX 1080 Ti down to en­try-level GTX 1050s. Nvidia also has its Volta ar­chi­tec­ture, re­leased last year, with its ma­chine-learn­ing cores, but this has been kept to spe­cial­ist mar­kets, apart from the eye-wa­ter­ing $2,999 Ti­tan V. Tur­ing is aimed at the mass mar­ket, al­though it’s start­ing at the top. It’ll run along­side Pas­cal-pow­ered cards for the fore­see­able fu­ture.

The Tur­ing ar­chi­tec­ture has sim­i­lar clock speeds to Pas­cal, but in­creases the CUDA (Com­pute Uni­fied De­vice Ar­chi­tec­ture— small par­al­lel pro­cess­ing units) core count by 15–20 per­cent. There are im­prove­ments to the SMs (Stream­ing Mul­tipro­ces­sors), and in me­mory band­width, plus a hand­ful of new graph­ics func­tions, but the cherry on top is the ad­di­tion of ded­i­cated RT and AI hard­ware.

One no­tice­able change is that Nvidia has launched three Tur­ing GPUs. Pas­cal cards are built on the same dies, with sec­tions en­abled and dis­abled to suit. Tur­ing cards have launched with three sep­a­rate dies: the TU102, TU104, and TU106. The smaller TU106 is in the RTX 2070 card, the TU104 in the RTX 2080, and the big TU102 in the halo RTX 2080 Ti, al­though not yet in a fully func­tional form. Right, let’s start dig­ging around in the tran­sis­tors to see what there is to elicit such a buzz. Be warned: There are co­pi­ous acronyms to come.

The 754 x 754mm TU102 is 60 per­cent larger than the big­gest Pas­cal GPU, with 55 per­cent more tran­sis­tors. It’s built around six GPCs (Graph­ics Pro­cess­ing Clus­ters), each of which has six TPCs (Tex­ture Pro­cess­ing Clus­ters), along with a ded­i­cated ras­ter­i­za­tion en­gine. Each TPC has a PolyMorph En­gine (a fixed-func­tion ge­om­e­try pipe­line), and two SMs. In­side th­ese are the cores: 64 CUDA, eight Ten­sor, and one RT. All clear?

There are 12 32-bit GDDR6 me­mory con­trollers, giv­ing 384-bit width in all. Each con­tains a clus­ter of eight ROPs (Ren­der Out­puts). In the RTX 2080 Ti card, one of th­ese con­trollers is dis­abled, giv­ing it 88 ROPs. The lay­out is not uni­form across the dies; the TU104 has eight rather than twelve SMs per GPC, for ex­am­ple, and it, and the TU106, only has eight me­mory con­trollers.

At the heart of ev­ery GPU is a fun­da­men­tal build­ing block, called a Stream­ing Mul­tipro­ce­sor by Nvidia, and a Com­pute Unit by AMD. It’s in here that the ma­jor­ity of the hard work is done. The Tur­ing ar­chi­tec­ture SM con­tains sched­ulers, graph­ics cores, cache, tex­tur­ing units, and more.

Start­ing at the top are the CUDA cores; Tur­ing has 64 per SM rather than Pas­cal’s 128. Over the past decade, Nvidia has had any­where from 32 to 192 per SM. The com­pany claims that the new ar­chi­tec­ture is more ef­fi­cient with 64. Tur­ing adds FP16 ( 16-bit float­ing point) sup­port, typ­i­cally twice as fast as FP32 op­er­a­tions, al­though not as com­monly used in games. We also get two 64-bit CUDA cores per SM, for com­pat­i­bil­ity sup­port. Volta has 32 of th­ese in each SM, but un­less you’re mod­el­ing a gal­axy, that die space is bet­ter used else­where.

New for Tur­ing is a ded­i­cated in­te­ger pipe­line that can run con­cur­rently with float­ing-point cores. Nvidia es­ti­mates that games have about 35 in­te­ger in­struc­tions for ev­ery 100 float­ing-point in­struc­tions. Pre­vi­ously, all went through the same pipe­line, so that’s a the­o­ret­i­cal im­prove­ment of over a third. It makes the GPU core more like a CPU core, which can re­turn two in­struc­tions per click. Tur­ing has one of th­ese for ev­ery CUDA.

Much of what we’ve cov­ered so far is present in Pas­cal, or a sub­tle evo­lu­tion of what we al­ready have, but now we get to the gen­uinely in­no­va­tive stuff. In­side each SM we have eight Ten­sor cores and one RT core. The RT core works on the head­line abil­ity of Tur­ing cards: real-time ray trac­ing. This may well pro­duce jaw-slack­en­ing re­sults, but the math is fiendish.

The most com­monly used ray-trac­ing al­go­rithm is BVH, (Bound­ing Vol­ume Hi­er­ar­chy). This en­cap­su­lates ob­jects into large sim­ple vol­umes. If a ray doesn’t in­ter­sect with this ob­ject, no more cal­cu­la­tions need be made. At its simplest, an en­tire 3D ob­ject may be de­fined as a sin­gle box. If a ray doesn’t in­ter­sect with this box, no more work is re­quired on that ray for that ob­ject. If it does, we delve into a hi­er­ar­chi­cal tree of in­creas­ing de­tail, as the ob­ject is bro­ken down into smaller and smaller parts. When the ray does in­ter­sect, that branch of the tree is fol­lowed. Thus we can get to the in­ter­sec­tion point by only check­ing a frac­tion of the points on an ob­ject.

Th­ese cal­cu­la­tions can be done else­where, on a CPU or GPU, but

even with BVH op­ti­miza­tion, it’s a mam­moth task that clogs what­ever is as­signed to it. En­ter our RT cores. Ded­i­cated to BVH cal­cu­la­tions, th­ese can run through the task about 10 times faster than the CUDA cores, and leave them free to work on other things while do­ing it.

It is not pos­si­ble to say pre­cisely how many rays the RT cores can cal­cu­late per se­cond, as it de­pends on the BVH struc­ture. So, per­for­mance fig­ures are an ap­prox­i­ma­tion. Nvidia quotes the RT cores as ca­pa­ble of over 10 Giga Rays per se­cond on the RTX 2080 Ti, and that each GR/s re­quires about 10 TFLOPS of com­pu­ta­tion.

Us­ing a sin­gle ray per pixel can re­sult in dozens or hun­dreds of cal­cu­la­tions, and bet­ter re­sults are achieved with more rays. If a scene has mul­ti­ple re­flec­tive sur­faces, things get ex­tremely com­pli­cated. Tra­di­tional ras­ter­i­za­tion has got pretty good over the last 20 years, but cer­tain things still present prob­lems, such as re­al­is­tic light­ing, shad­ows, and re­flec­tions.

Re­flec­tions can be faked, but mir­rors take se­ri­ous work, as they re­quire a com­plete new pro­jec­tion into the game world. Shadow maps can pro­duce de­cent re­sults, but need a lot of me­mory to get right, as well as care­ful place­ment of lights. An­other prob­lem is am­bi­ent oc­clu­sion, which says how ex­posed an ob­ject is to am­bi­ent ligt­ing. Ray trac­ing mas­ters all th­ese.

Pow­er­ful as Tur­ing is, we’re still a long way off the re­quired horse­power for true real-time ray trac­ing. So, like any good GPU tech, we cheat. Hy­brid ren­der­ing uses tra­di­tional ras­ter­i­za­tion tech­nol­ogy to ren­der all the poly­gons in a frame, then com­bines the re­sult with se­lected ray-traced shad­ows, re­flec­tions, and/or re­frac­tions. The ray trac­ing ends up be­ing much less com­plex, al­low­ing for higher frame rates. There’s a bal­anc­ing act be­tween qual­ity and per­for­mance. Cast­ing more rays for a scene can im­prove the re­sult at the cost of frame rate, and vice versa.

Tak­ing the GeForce RTX 2080 Ti and its 10GR/s as a base­line, if we’re ren­der­ing a game at 1080p, that’s roughly two mil­lion pix­els, and run­ning at 60fps means 120 mil­lion pix­els a se­cond. Do­ing the math, a game could man­age 80 rays per pixel if the GPU is do­ing noth­ing else. Move to 4K, and we drop to 20 rays per pixel.

Here is where those ma­chine­learn­ing Ten­sor cores can help. Nvidia’s DLSS (Deep Learn­ing Su­per Sam­pling) is trained on a batch of game screen­shots on Nvidia’s own su­per­com­puter un­til it learns to do it ef­fec­tively. The re­sults are pack­aged into a neat file for the Ten­sor cores to run lo­cally.

This en­ables games to ren­der at lower res­o­lu­tions with­out an­tialias­ing, then use the Ten­sor cores to up­scale and an­tialias the image. Or sim­ply as a faster al­ter­na­tive to tra­di­tional TAA

(Tem­po­ral Anti-Alias­ing) with­out up­scal­ing, The re­sults are bet­ter, too; lack­ing the trans­parency and blur­ring TAA arte­facts ex­hibits.

Nvidia has shown some in­ter­est­ing de­mos of games ren­dered at 1080p, then up­scaled to 4K, with­out the per­for­mance hit you’d ex­pect. DLSS has gath­ered more sup­port than ray trac­ing, too, with 25 games in the pipe­line.

Denois­ing is an­other po­tent tool for ray trac­ing. Many path-trac­ing sys­tems are good at cre­at­ing quick but coarse scenes. Ma­chine learn­ing can re­duce the coarse, speck­led ef­fect, with­out the huge com­pu­ta­tional over­head of run­ning a more com­plete ray-traced image. Pixar re­port­edly used such a sys­tem to rad­i­cally re­duce ren­der­ing times on its an­i­ma­tions.

The Ten­sor cores don’t work con­cur­rently with the rest of the GPU, so while they’re busy, the rest of the sil­i­con is fairly idle. This lim­its its use. Nvidia sug­gests that DLSS and denois­ing could run at 20 per­cent of the to­tal frame time.

Help­ing to boost graph­ics per­for­mance is pushed as the Ten­sor cores’ main ben­e­fit, but there’s more they can do. Nvidia and Mi­crosoft have cre­ated the DirectX Ma­chine Learn­ing API. Fu­ture games could use all sorts of AI tricks, such as voice con­trol and more de­vi­ous AI op­po­nents.

Th­ese changes have meant that com­par­ing per­for­mance be­tween dif­fer­ent gen­er­a­tions of GPU has be­come com­pli­cated. For ex­ist­ing games that don’t use the new fea­tures, the old FP32 TFLOPS fig­ure is a rea­son­able mea­sure, but once you add hy­brid ray trac­ing, it doesn’t fit the job, so Nvidia has de­vised a new met­ric: RTX-OPS.

Ob­vi­ously, this mea­sure­ment will fa­vor RTX GPUs. In a game that makes full use of ray-trac­ing ef­fects, the work­load is dis­trib­uted as fol­lows: 80 per­cent run­ning FP32 shad­ing (what games cur­rently spend time on), with 35 per­cent of that time on con­cur­rent INT32 shad­ing. Ray trac­ing is used for 40 per­cent of the time, and the out­put is pro­cessed by the Ten­sor cores for 20 per­cent of the time. This fi­nal pol­ish is the only thing run­ning at this point, as the Ten­sor cores need the GPU’s full at­ten­tion.

Us­ing this for­mula, we have an RTX-OPS fig­ure for the Founders Edi­tion of the RTX 2080 Ti of 78 RTX-OPS. How would a GTX 1080 Ti fare? It lacks the RT and Ten­sor cores, and can’t run the in­te­ger and float­ing point to­gether, so its RTXOPS score would sim­ply equal its TFLOPS score of 11.3. This looks lack-lus­ter, but it’s ap­ples and or­anges: The RTX-OPS value is only of in­ter­est if you are run­ning a hy­brid ray-trac­ing en­gine.

To keep its new GPUs fed with data, Nvidia has moved its Tur­ing cards to GDDR6 me­mory, and re­worked the cache and me­mory sub­sys­tem. The L1 cache band­width has been dou­bled, and can now run as ei­ther 32K of L1 and 64K shared me­mory, or as 64K L1 cache and 32K shared. The L2 cache has been dou­bled, too.

The faster clock speeds of GDDR6 over GDDR5 help, but Tur­ing goes fur­ther. Pas­cal al­ready used loss­less me­mory com­pres­sion, and Tur­ing has im­proved on this. Nvidia hasn’t pro­vided de­tails, but claims that the larger caches and im­proved com­pres­sion have in­creased the ef­fec­tive band­width by 20–35 per­cent over Pas­cal.

Tur­ing in­tro­duces a num­ber of new func­tions, too. How much th­ese are used is an­other mat­ter. Nvidia’s VXAO (Voxel Am­bi­ent Oc­clu­sion) has, as far as we know, only been used in two games in two years.

Fore­most of the new fea­tures is Mesh Shad­ing, the next it­er­a­tion of ver­tex, ge­om­e­try, and tes­sel­la­tion shaders. The idea is to move the LOD (Level of De­tail—we weren’t kid­ding about the num­ber of acronyms) from the CPU to the GPU. There are some im­pres­sive de­mos of this. It needs to be im­ple­mented at the API level, so must be built into DirectX/ Vulkan be­fore it can be­come wide­spread.

Next is VRS (Vari­able Rate Shad­ing), which is about mov­ing pro­cess­ing power to where it can have a tan­gi­ble ef­fect. Here it uses more shaders in im­por­tant ar­eas, and fewer where re­sults are neg­li­gi­ble. Nvidia sug­gests a 15 per­cent per­for­mance boost when used ef­fec­tively. It can be used as part of MAS (Mo­tion Adap­tive Shad­ing), where fast-mov­ing ob­jects need less work as they look blurred any­way, and CAS (Con­stant Adap­tive Shad­ing), where ef­fort is fo­cused on pri­mary con­tent, on the car in a driv­ing game, for ex­am­ple.

Nvidia talked about two fur­ther fea­tures of Tur­ing: MVR (Multi-View Ren­der­ing), an en­hanced ver­sion of the Si­mul­ta­ne­ous Multi-Pro­jec­tion that was al­ready in Pas­cal, and TSS (Tex­ture Space Shad­ing). Where SMP fo­cused on two views and VR ap­pli­ca­tions, MVP can do four views per pass, and re­moves some viewde­pen­dent at­tributes. It should help im­prove VR ap­pli­ca­tions, es­pe­cially with some of the newer VR head­sets that have a wider field of view. TSS is an­other method for re­duc­ing the shad­ing re­quired,

and re­ally only of in­ter­est if you are writ­ing a game en­gine.

And fi­nally, we have a boost to video en­cod­ing/de­cod­ing. Pas­cal wasn’t bad at this, but the re­sults weren’t al­ways as good as the x264 Fast pro­file run­ning on a CPU. Tur­ing aims to de­liver bet­ter than x264 Fast, with al­most no CPU load. If you’re stream­ing at 1080p, it won’t mat­ter too much, as Pas­cal or a good CPU can cope, but move to 4K, and you risk dropped frames. Tur­ing aims to serve 4K en­cod­ing with al­most no CPU load.

Are we im­pressed? It is dif­fi­cult not to be. The Tur­ing ar­chi­tec­ture is a mas­ter­piece from the lead­ing graph­ics tech­nol­ogy com­pany. It’s the re­sult of 10,000 en­gi­neer­ingyears of ef­fort (says Nvidia). Not only do we get a de­cent step up in the clas­sic GPU ar­chi­tec­ture, we get fea­tures that change the way graph­ics are re­al­ized. The hy­brid ap­proach, fus­ing ras­ter­i­za­tion with real-time ray trac­ing, and adding AI-pow­ered pro­cess­ing en­ables some­thing spe­cial. The road to full real-time ray trac­ing is here, years be­fore most ex­pected it.

We have the hard­ware, now we need the sup­port. The cur­rent list of games is small, a dozen or so, but emerg­ing tech gen­er­ally starts small. Ray trac­ing is a bet­ter way to ren­der scenes, and the prospect of us­ing it in games, how­ever par­tially, is too tan­ta­liz­ing to re­sist. Big names have lined up be­hind Tur­ing, in­clud­ing Epic, with its Un­real En­gine, and EA’s Frost­byte. Mi­crosoft has plumbed ray trac­ing into DirectX, giv­ing the al­limpor­tant route to stan­dard­iza­tion.

At some point in the next few months, or maybe more, you are go­ing to see a game that pushes the graph­ics to such a point that you’re go­ing to want a card that can run it, and want it very badly.

The sil­i­con heart of Tur­ing in all its naked glory. Am­bi­tious, but this is go­ing to change ev­ery­thing, even­tu­ally.

Global ray­traced light­ing gen­er­ates nat­u­ral light­ing and pixel-per­fect shad­ows, with proper um­bra and penum­bra; here at work in Modus Ex­o­dus.

Ansel RTX is a neat screen grab util­ity that en­ables you to crank up the ray-trac­ing horse­power when you take a screen­shot. It’ll also do clever AI up­scal­ing in non-RTX games, push­ing screen­shots to 8K.

Who doesn’t like a block di­a­gram of new sil­i­con? Here’s the TU102 core laid out and col­or­coded, show­ing its tightly in­te­grated de­sign.

The GeForce RTX 2080 Ti Founders Edi­tion—yours for $1,199, and lim­ited to two per cus­tomer. The most ex­pen­sive GeForce card ever, and the fastest.

Newspapers in English

Newspapers from Australia

© PressReader. All rights reserved.