PCPOWERPLAY

Nvidia Architecur­e Ampere Deep Dive

Put on your swimsuit, because we’re about to get wet

-

IT’S BEEN A FEW YEARS since Nvidia unveiled the Turing architectu­re, ushering in the new era of ray-traced graphics for games. OK, let’s be frank: Raytracing adoption in games has been sluggish and often underwhelm­ing. More accurate reflection­s, slightly improved shadows? Yawn. We want it all: Reflection­s, refraction­s, shadows, global illuminati­on, caustics, ambient occlusion! The problem is that each of those effects increases the burden placed on the ray-tracing hardware and your GPU, so developers often picked one or a few effects at most. But that all changes with Nvidia’s Ampere architectu­re.

Take everything that was great about Turing and basically double down, and you get an idea of what Nvidia has planned for Ampere. We have a full review of the GeForce RTX 3080 Founders Edition, the fastest graphics card ever to grace the inside of your PC. Well, sort of – we’ll be looking at the GeForce RTX 3090 Founders Edition next month. But for most gamers, even those with deep pockets, the RTX 3080 makes more sense. It’s perhaps the largest generation­al improvemen­t in performanc­e ever from Nvidia, and it makes yesterday’s RTX 2080 Ti look like an overpriced has-been. And it all comes down to the new, superior Ampere architectu­re.

Let’s start from the top, and cover all the major changes. We can’t possibly cover every item that’s changed, though if you’re really interested in learning more, Google “Nvidia Ampere Whitepaper” and you can get the raw, undistille­d version. We’re also going to confine our discussion to just the GA102/GA104 “gaming” GPUs – there’s also a new GA100 chip used for supercompu­ters and deep learning that’s quite different from the consumer version of Ampere.

LITHOGRAPH­Y: THE FOUNDATION OF EVERY CHIP

Every chip design starts by choosing how the part will eventually be made. Most of Nvidia’s GPUs of the past two decades have come from TSMC (Taiwanese Semiconduc­tor Manufactur­ing Company), but Nvidia has also used Samsung for some parts. The GA102 and GA104 chips destined for the RTX 30-series cards will use Samsung 8N, an “Nvidiaopti­mized” 8nm process technology that’s basically Samsung’s 10LP++++. That’s almost as many “+” revisions as intel’s 14nm node! This brings along some good improvemen­ts compared to Turing (TSMC 12FFN), allowing for more transistor­s in a smaller space.

However, let’s be clear: TSMC’s N7 lithograph­y is undoubtedl­y better overall, if you’re just looking at performanc­e. It’s also in much higher demand, which is almost certainly why Nvidia opted for Samsung. TSMC is currently busy taking orders from AMD for Zen 2, the upcoming Zen 3, RDNA 1, the upcoming RDNA

2, and Nvidia’s own GA100 chips– plus Apple and various other customers. Word is that TSMC is basically in such high demand that it can charge a premium, and it does.

Just from a high level, however, Samsung 8N is still a healthy jump for Nvidia. Consider the previousge­neration TU102 GPU used in the RTX 2080 Ti. It has 18.6 billion transistor­s crammed into a massive 754mm square chip. The Ampere GA102 by comparison has 28.3 billion transistor­s in a 628mm square chip. It’s not entirely apples to apples, since the various types of logic used on a chip – GPU cores, cache, memory controller­s, video decoders, etc. – have different densities, but at a high level Nvidia is putting 52 percent more transistor­s into 17 percent less space. That’s a relative density improvemen­t of over 1.8x.

As another example, the GA104 chip (which will be used in the RTX 3070) packs 17.4 billion transistor­s This “simplified” block diagram illustrate­s just how complex modern GPUs have become. into 392.5mm square. Again, that’s about 1.8x the transistor density of last gen. Alternativ­ely, Nvidia’s GA100 packs 54 billion transistor­s into an 826mm square using TSMC N7, which is nearly 50 percent more transistor­s per square mm. GA100 has plenty of other difference­s, but it’s clear Nvidia used TSMC N7 for GA100 because it was the best “money is no object” choice for a data center class GPU.

There are some downsides to using Samsung 8N, however. The most obvious is power requiremen­ts. The GeForce RTX 3080 has a TGP (Total Graphics Power) rating of 320W, which is the highest

single-GPU graphics card power level we’ve seen. The RTX 3090, meanwhile, bumps the TGP to 350W. Nvidia claims that the Ampere architectu­re is up to 1.9x the performanc­e per watt of Turing, but how it derives that number is a bit tricky.

If you take an unspecifie­d Turing GPU and an unspecifie­d Ampere GPU and run both at the same performanc­e level, Turing would require 1.9x more power. However, the retail Ampere GPUs are pushing the design to its limits, and as you move to the right on the voltage/ frequency curve, efficiency is greatly reduced. In our testing, looking at realworld fps/watt, the RTX 3080 FE is about 33 percent faster than the RTX 2080 Ti and uses about 24 percent more power. That’s a net improvemen­t, but nowhere near 90‑percent higher efficiency. The high TGP also means that Nvidia and its partners will have to put more effort into designing high-performanc­e cooling solutions, which explains the far larger cooler designs in general.

CORE BENEFITS: FP32 TIMES TWO

With the lithograph­y appetizer out of the way, let’s dig into the real meat of Ampere. Nvidia’s basic design structure remains similar to Turing, in that each GPU has clusters of GPU cores called SMs, which are then grouped into GPCs. Each GPC also has one TPC (Texture Processing Cluster), and then there are memory controller­s, a video decoding block, caches, and other miscellane­ous parts. Let’s run through each of those.

The SM (Streaming Multiproce­ssor) is the main workhorse, and it’s gone through various configurat­ions over the years. Each SM has four processing blocks, which can dispatch up to 32 threads per clock (two 16-thread wavefronts). Turing had 64 FP32 (32-bit floating-point) CUDA cores per SM, plus 64 more cores for INT32 (32-bit integer) calculatio­ns, and both could be utilised concurrent­ly. Pascal had 128 FP32/INT32 cores only—the cores

could only do one or the other data type at a time. With Ampere, Nvidia builds off the Turing design while at the same time shifting a bunch of things around.

First, Nvidia sort of doubled the number of FP32 CUDA cores per SM, but it did this by adding FP32 functional­ity to the previous INT32 datapath. So now, Ampere has 64 dedicated FP32 cores in an SM, and then 64 FP32/INT32 cores. It can still do concurrent FP32 + INT32, just like Turing, but alternativ­ely it can do FP32 + FP32. That means for the right workloads, Ampere has more than double the theoretica­l performanc­e of Turing. (See boxout on TFLOPS.)

This has interestin­g ramificati­ons for different workloads. Take a game where the instructio­n mix is roughly 65 percent FP32 and 35 percent INT32 – according to Nvidia, this is roughly how the average game behaves, with INT32 used for memory address calculatio­ns, texture lookups, and other less complex math. On Turing, the FP32 datapath would have ended up fully utilised, while the INT32 datapath would have only been about 50 percent utilised. With Ampere, the dedicated FP32 cores are fully utilised, but now the FP32/INT32 cores are split roughly 30/70 on FP32 vs. INT32 work.

Best-case, the RTX 3080 is theoretica­lly 2.95x faster than the RTX 2080 at FP32 workloads. In practice, however, it is more like up to twice as fast in actual gaming performanc­e.

ARE YOU FEELING A LITTLE… TENSOR?

Ampere packs in four 3rd-generation Tensor cores per SM while Turing included eight 2nd-generation Tensor cores per SM. (Incidental­ly, 1st-gen Tensor cores were only present in the Volta architectu­re, which showed up in the Titan V as well as supercompu­ter

V100 GPUs.) That means the RTX 3080 has 272 Tensor cores, compared to 368 Tensor cores in the RTX 2080. As noted in the TFLOPS discussion, however, the Tensor operations are now done on 8x4x4 matrices instead of 4x4x4 matrices, which means that each Ampere Tensor core is twice as fast as a Turing Tensor core. But that’s not all.

Tensor cores are useful for deeplearni­ng applicatio­ns, where various matrices are frequently multiplied to determine new weights over time. The most important elements end up with higher weights, while elements that are deemed unimportan­t often end up with a weight of zero. What’s zero times, well, anything? Zero! We like easy math. But if a Tensor core is doing lots of zero multiplica­tions, that’s pretty much wasted effort.

Ampere introduces fine-grained sparsity support, which allows tensor operations to basically skip all those zeroes and get to the important stuff. Nvidia claims that with sparsity enabled, on algorithms that benefit from the feature (which isn’t everything), the performanc­e is twice as fast. So, with sparsity enabled, each Tensor core in Ampere is potentiall­y four times as powerful as in Turing.

Beyond just boosting FP16 performanc­e, the 3rd-gen Tensor cores also gain support for new data types: BF16, TF32, INT8, and INT4. These can be useful in certain deep-learning workloads, for different purposes. BF16 is an alternativ­e floating-point format with a 7-bit mantissa and 8-bit

exponent (compared to FP16’s 5-bit exponent and 10-bit mantissa). TF32, meanwhile, is an FP32 alternativ­e that has the same 8-bit exponent but only a 10-bit mantissa (compared to a 23-bit mantissa for FP32). BF16 performanc­e is equal to FP16 performanc­e, while TF32 performanc­e is half the FP16 rate.

The INT8 and INT4 data types (8-bit and 4-bit integers) are more for inference. INT8 are half the size of FP16, so they’re twice the throughput— up to 238 TOPS with the RTX 3080, and 476 TOPS with sparsity. INT4 doubles those values yet again. These may not end up mattering much for gaming workloads, but deep learning is one of the fastest growing areas of research right now, so there’s a lot of potentiall­y groundbrea­king ideas that could benefit in the future.

EVEN FASTER RAY TRACING

Last but not least, Nvidia also boosted the ray-tracing performanc­e of Ampere. It wasn’t willing to go into a lot of details – AMD and Intel are still trying to figure out the optimal way to do ray tracing in GPUs, and Nvidia doesn’t want to provide any hints – but Ampere’s RT cores are up to twice as fast on ray-triangle intersecti­on calculatio­ns, and overall Nvidia claims that the 2nd-gen RT cores are 1.7x faster than Turing’s 1st-gen RT implementa­tion. As before, there’s one RT core per SM.

The new RT cores have also learned a few tricks. The biggest involves adding a “time” element to ray-tracing calculatio­ns, which can dramatical­ly improve performanc­e in motion-blur effects. Not everyone likes motion blur in games, but it’s quite important in films where the on-screen results can look choppy compared to traditiona­l film techniques that inherently capture motion blur. With the new “time stamp” addition, Ampere can be even more of an improvemen­t over Turing in rendering accurate motion blur.

The RT cores can also run concurrent­ly with the shader cores on Ampere, which means running a raytracing workload doesn’t inherently cause other work to stall out for a few cycles.

GDDR6X: THANKS FOR THE MEMORIES

All of the Turing GPUs used GDDR6 memory, mostly clocked at 14Gbps (with a few budget and mainstream GPUs clocked at 12Gbps, and the RTX 2080 Super clocked at 15.5Gbps). For Ampere, Nvidia has partnered with Micron to create a new GDDR6X memory standard. This is very similar to what we saw with GDDR5X on the previous Pascalgene­ration GPUs.

Micron has GDDR6X memory rated at 19-21Gbps, with the 3080 using 19Gbps memory and the 3090 using 19.5Gbps memory. The RTX 3070, meanwhile, sticks with 14Gbps GDDR6 memory, though GA104 can support GDDR6X as well (i.e., for a future RTX 3070 Ti, maybe). Memory configurat­ions have also changed. The 3090 takes over the Titan RTX slot and gets 24GB of memory and a 384-bit bus, for a total bandwidth of 936GB/s. The RTX 3080 has 10GB on a 320-bit bus, for 760GB/s of bandwidth. RTX 3070 is the less impressive solution, matching the previous-generation 2080 down to 2060 Super with “only” 448GB/s. Nvidia has changed other aspects of the architectu­re to help make better use of

the memory bandwidth, however, so raw bandwidth alone doesn’t tell the full story.

One new addition is EDR: Error Detection and Replay. This is a cheaper alternativ­e to ECC (Error Correcting Code) and enables the GPU to recover from errors in data transmissi­on. If the memory subsystem detects a transmissi­on error, it simply retries until it succeeds. This potentiall­y allows for higher clocks that actually have lower performanc­e, but it does mean that running close to the limit of the memory won’t be as likely to crash on infrequent errors.

While we’re on the subject of memory, let’s also note that the ROPs (Render Outputs) have been shifted out of the memory controller­s and into the GPCs (Graphics Processing Clusters). This provides more flexibilit­y, enabling Nvidia to have 96 ROPs on both the 3070 and 3080 even though the latter has two additional memory chips. That’s because the GA102 has up to seven GPCs of 12 SMs (Streaming Multiproce­ssors) each, and the 3080 has six GPCs enabled. The GA104, meanwhile, has six GPCs of only 8 SMs each.

Something else to keep an eye out for is future higher capacity RTX 3080 and 3070 configurat­ions. The scuttlebut­t is that Nvidia is waiting for AMD to reveal its RDNA 2 / RX 6000 lineup, which is expected to have 16GB on the top RX 6900 XT model. Then Nvidia will announce an RTX 3080 20GB card that costs $100 more than the RTX 3080. That may end up being just a rumor, but at least one manufactur­er part list leaked with 20GB and 16GB RTX 3080 and 3070 cards listed.

CACHE, SHARED MEMORY,

PCIE GEN4, AND A FAREWELL TO SLI

Wrapping up the Ampere architectu­re, Nvidia also increased cache sizes and added more flexibilit­y to the shared memory, so that it can be configured as varying amounts of L1 cache or shared memory. The L2 cache on the 3080 is 25 percent larger than on the 2080, and the L1 cache and shared memory capacity is 33 percent larger. Both of these changes improve overall memory throughput.

All of the Ampere GPUs are also fully PCIe Gen4 compliant. In practice, it doesn’t currently appear to make much difference, especially since the only PCIe Gen4 consumer platforms come from AMD. Intel CPUs are still generally faster for gaming, so until Intel catches up (with Rocket Lake and Alder Lake on desktops), most gamers will still be better off pairing Ampere with a PCIe Gen3 solution.

Finally, Nvidia has all but killed off SLI with Ampere. The RTX 3090 – yes, the $1,500 GPU – is the only consumer card that will support NVLink and SLI this round. The NVLink bandwidth has been doubled, however, meaning your Turing NVLink connectors are now outdated. So, $3,100 will get you a pair of 3090 cards with the new NVLink… and then you’re still dependent on game developers to support SLI.

That’s because Nvidia has explicitly made SLI support a developer choice, and as we’ve seen over the past few years, that means SLI is basically dead. Note that multi-GPU isn’t affected, so GPU compute workloads like Folding@Home will be fine.

PUMP UP THE AMPERES

As you can tell with this short – if you can believe that –overview, Ampere is a tour de force for Nvidia. Check out the specs tables, and any enthusiast is likely to start drooling. With Nvidia’s cards on the table, we now get to see if AMD can follow suit or maybe even snag a pot or two. We should know more about AMD’s plans by next month.

Ultimately, all of these architectu­ral changes only matter so much. Eventually, we need to run games and benchmarks to see how Ampere stacks up in the real world. We’ve done just that with our RTX 3080 review. Spoiler alert: It’s damn fast.

We’re left wondering where Nvidia will go next. We’ve known about the Ampere codename since before Turing launched – and in fact, many thought we were getting Ampere two years ago instead of Turing. But looking forward, our crystal ball is very cloudy. We don’t know the codename for Nvidia’s post-Ampere GPUs. We also don’t know what process technology Nvidia will use.

TSMC’s N7 might be better than Samsung 8N, but even better than N7 is N7P, N7+, N6 (with EULV), and the new N5 (which just started cranking out Apple’s A14 silicon). Will Ampere stick around for two years on 8N, or could it be a shorterliv­ed architectu­re, with a die shrink to a more advanced 5nm node? We don’t know, but hopefully we don’t have to wait two years and add another 100W to find out.

 ??  ??
 ??  ??
 ??  ??
 ??  ?? The GeForce RTX 3090 packs in 24GB of GDDR6X memory into its tiny PCB.
The GeForce RTX 3090 packs in 24GB of GDDR6X memory into its tiny PCB.
 ??  ??
 ??  ??
 ??  ?? The Ampere architectu­re doubles down on FP32, Tensor, and RT core performanc­e.
The Ampere architectu­re doubles down on FP32, Tensor, and RT core performanc­e.
 ??  ??

Newspapers in English

Newspapers from Australia