Nvidia Architecure Ampere Deep Dive
Put on your swimsuit, because we’re about to get wet
IT’S BEEN A FEW YEARS since Nvidia unveiled the Turing architecture, ushering in the new era of ray-traced graphics for games. OK, let’s be frank: Raytracing adoption in games has been sluggish and often underwhelming. More accurate reflections, slightly improved shadows? Yawn. We want it all: Reflections, refractions, shadows, global illumination, caustics, ambient occlusion! The problem is that each of those effects increases the burden placed on the ray-tracing hardware and your GPU, so developers often picked one or a few effects at most. But that all changes with Nvidia’s Ampere architecture.
Take everything that was great about Turing and basically double down, and you get an idea of what Nvidia has planned for Ampere. We have a full review of the GeForce RTX 3080 Founders Edition, the fastest graphics card ever to grace the inside of your PC. Well, sort of – we’ll be looking at the GeForce RTX 3090 Founders Edition next month. But for most gamers, even those with deep pockets, the RTX 3080 makes more sense. It’s perhaps the largest generational improvement in performance ever from Nvidia, and it makes yesterday’s RTX 2080 Ti look like an overpriced has-been. And it all comes down to the new, superior Ampere architecture.
Let’s start from the top, and cover all the major changes. We can’t possibly cover every item that’s changed, though if you’re really interested in learning more, Google “Nvidia Ampere Whitepaper” and you can get the raw, undistilled version. We’re also going to confine our discussion to just the GA102/GA104 “gaming” GPUs – there’s also a new GA100 chip used for supercomputers and deep learning that’s quite different from the consumer version of Ampere.
LITHOGRAPHY: THE FOUNDATION OF EVERY CHIP
Every chip design starts by choosing how the part will eventually be made. Most of Nvidia’s GPUs of the past two decades have come from TSMC (Taiwanese Semiconductor Manufacturing Company), but Nvidia has also used Samsung for some parts. The GA102 and GA104 chips destined for the RTX 30-series cards will use Samsung 8N, an “Nvidiaoptimized” 8nm process technology that’s basically Samsung’s 10LP++++. That’s almost as many “+” revisions as intel’s 14nm node! This brings along some good improvements compared to Turing (TSMC 12FFN), allowing for more transistors in a smaller space.
However, let’s be clear: TSMC’s N7 lithography is undoubtedly better overall, if you’re just looking at performance. It’s also in much higher demand, which is almost certainly why Nvidia opted for Samsung. TSMC is currently busy taking orders from AMD for Zen 2, the upcoming Zen 3, RDNA 1, the upcoming RDNA
2, and Nvidia’s own GA100 chips– plus Apple and various other customers. Word is that TSMC is basically in such high demand that it can charge a premium, and it does.
Just from a high level, however, Samsung 8N is still a healthy jump for Nvidia. Consider the previousgeneration TU102 GPU used in the RTX 2080 Ti. It has 18.6 billion transistors crammed into a massive 754mm square chip. The Ampere GA102 by comparison has 28.3 billion transistors in a 628mm square chip. It’s not entirely apples to apples, since the various types of logic used on a chip – GPU cores, cache, memory controllers, video decoders, etc. – have different densities, but at a high level Nvidia is putting 52 percent more transistors into 17 percent less space. That’s a relative density improvement of over 1.8x.
As another example, the GA104 chip (which will be used in the RTX 3070) packs 17.4 billion transistors This “simplified” block diagram illustrates just how complex modern GPUs have become. into 392.5mm square. Again, that’s about 1.8x the transistor density of last gen. Alternatively, Nvidia’s GA100 packs 54 billion transistors into an 826mm square using TSMC N7, which is nearly 50 percent more transistors per square mm. GA100 has plenty of other differences, but it’s clear Nvidia used TSMC N7 for GA100 because it was the best “money is no object” choice for a data center class GPU.
There are some downsides to using Samsung 8N, however. The most obvious is power requirements. The GeForce RTX 3080 has a TGP (Total Graphics Power) rating of 320W, which is the highest
single-GPU graphics card power level we’ve seen. The RTX 3090, meanwhile, bumps the TGP to 350W. Nvidia claims that the Ampere architecture is up to 1.9x the performance per watt of Turing, but how it derives that number is a bit tricky.
If you take an unspecified Turing GPU and an unspecified Ampere GPU and run both at the same performance level, Turing would require 1.9x more power. However, the retail Ampere GPUs are pushing the design to its limits, and as you move to the right on the voltage/ frequency curve, efficiency is greatly reduced. In our testing, looking at realworld fps/watt, the RTX 3080 FE is about 33 percent faster than the RTX 2080 Ti and uses about 24 percent more power. That’s a net improvement, but nowhere near 90‑percent higher efficiency. The high TGP also means that Nvidia and its partners will have to put more effort into designing high-performance cooling solutions, which explains the far larger cooler designs in general.
CORE BENEFITS: FP32 TIMES TWO
With the lithography appetizer out of the way, let’s dig into the real meat of Ampere. Nvidia’s basic design structure remains similar to Turing, in that each GPU has clusters of GPU cores called SMs, which are then grouped into GPCs. Each GPC also has one TPC (Texture Processing Cluster), and then there are memory controllers, a video decoding block, caches, and other miscellaneous parts. Let’s run through each of those.
The SM (Streaming Multiprocessor) is the main workhorse, and it’s gone through various configurations over the years. Each SM has four processing blocks, which can dispatch up to 32 threads per clock (two 16-thread wavefronts). Turing had 64 FP32 (32-bit floating-point) CUDA cores per SM, plus 64 more cores for INT32 (32-bit integer) calculations, and both could be utilised concurrently. Pascal had 128 FP32/INT32 cores only—the cores
could only do one or the other data type at a time. With Ampere, Nvidia builds off the Turing design while at the same time shifting a bunch of things around.
First, Nvidia sort of doubled the number of FP32 CUDA cores per SM, but it did this by adding FP32 functionality to the previous INT32 datapath. So now, Ampere has 64 dedicated FP32 cores in an SM, and then 64 FP32/INT32 cores. It can still do concurrent FP32 + INT32, just like Turing, but alternatively it can do FP32 + FP32. That means for the right workloads, Ampere has more than double the theoretical performance of Turing. (See boxout on TFLOPS.)
This has interesting ramifications for different workloads. Take a game where the instruction mix is roughly 65 percent FP32 and 35 percent INT32 – according to Nvidia, this is roughly how the average game behaves, with INT32 used for memory address calculations, texture lookups, and other less complex math. On Turing, the FP32 datapath would have ended up fully utilised, while the INT32 datapath would have only been about 50 percent utilised. With Ampere, the dedicated FP32 cores are fully utilised, but now the FP32/INT32 cores are split roughly 30/70 on FP32 vs. INT32 work.
Best-case, the RTX 3080 is theoretically 2.95x faster than the RTX 2080 at FP32 workloads. In practice, however, it is more like up to twice as fast in actual gaming performance.
ARE YOU FEELING A LITTLE… TENSOR?
Ampere packs in four 3rd-generation Tensor cores per SM while Turing included eight 2nd-generation Tensor cores per SM. (Incidentally, 1st-gen Tensor cores were only present in the Volta architecture, which showed up in the Titan V as well as supercomputer
V100 GPUs.) That means the RTX 3080 has 272 Tensor cores, compared to 368 Tensor cores in the RTX 2080. As noted in the TFLOPS discussion, however, the Tensor operations are now done on 8x4x4 matrices instead of 4x4x4 matrices, which means that each Ampere Tensor core is twice as fast as a Turing Tensor core. But that’s not all.
Tensor cores are useful for deeplearning applications, where various matrices are frequently multiplied to determine new weights over time. The most important elements end up with higher weights, while elements that are deemed unimportant often end up with a weight of zero. What’s zero times, well, anything? Zero! We like easy math. But if a Tensor core is doing lots of zero multiplications, that’s pretty much wasted effort.
Ampere introduces fine-grained sparsity support, which allows tensor operations to basically skip all those zeroes and get to the important stuff. Nvidia claims that with sparsity enabled, on algorithms that benefit from the feature (which isn’t everything), the performance is twice as fast. So, with sparsity enabled, each Tensor core in Ampere is potentially four times as powerful as in Turing.
Beyond just boosting FP16 performance, the 3rd-gen Tensor cores also gain support for new data types: BF16, TF32, INT8, and INT4. These can be useful in certain deep-learning workloads, for different purposes. BF16 is an alternative floating-point format with a 7-bit mantissa and 8-bit
exponent (compared to FP16’s 5-bit exponent and 10-bit mantissa). TF32, meanwhile, is an FP32 alternative that has the same 8-bit exponent but only a 10-bit mantissa (compared to a 23-bit mantissa for FP32). BF16 performance is equal to FP16 performance, while TF32 performance is half the FP16 rate.
The INT8 and INT4 data types (8-bit and 4-bit integers) are more for inference. INT8 are half the size of FP16, so they’re twice the throughput— up to 238 TOPS with the RTX 3080, and 476 TOPS with sparsity. INT4 doubles those values yet again. These may not end up mattering much for gaming workloads, but deep learning is one of the fastest growing areas of research right now, so there’s a lot of potentially groundbreaking ideas that could benefit in the future.
EVEN FASTER RAY TRACING
Last but not least, Nvidia also boosted the ray-tracing performance of Ampere. It wasn’t willing to go into a lot of details – AMD and Intel are still trying to figure out the optimal way to do ray tracing in GPUs, and Nvidia doesn’t want to provide any hints – but Ampere’s RT cores are up to twice as fast on ray-triangle intersection calculations, and overall Nvidia claims that the 2nd-gen RT cores are 1.7x faster than Turing’s 1st-gen RT implementation. As before, there’s one RT core per SM.
The new RT cores have also learned a few tricks. The biggest involves adding a “time” element to ray-tracing calculations, which can dramatically improve performance in motion-blur effects. Not everyone likes motion blur in games, but it’s quite important in films where the on-screen results can look choppy compared to traditional film techniques that inherently capture motion blur. With the new “time stamp” addition, Ampere can be even more of an improvement over Turing in rendering accurate motion blur.
The RT cores can also run concurrently with the shader cores on Ampere, which means running a raytracing workload doesn’t inherently cause other work to stall out for a few cycles.
GDDR6X: THANKS FOR THE MEMORIES
All of the Turing GPUs used GDDR6 memory, mostly clocked at 14Gbps (with a few budget and mainstream GPUs clocked at 12Gbps, and the RTX 2080 Super clocked at 15.5Gbps). For Ampere, Nvidia has partnered with Micron to create a new GDDR6X memory standard. This is very similar to what we saw with GDDR5X on the previous Pascalgeneration GPUs.
Micron has GDDR6X memory rated at 19-21Gbps, with the 3080 using 19Gbps memory and the 3090 using 19.5Gbps memory. The RTX 3070, meanwhile, sticks with 14Gbps GDDR6 memory, though GA104 can support GDDR6X as well (i.e., for a future RTX 3070 Ti, maybe). Memory configurations have also changed. The 3090 takes over the Titan RTX slot and gets 24GB of memory and a 384-bit bus, for a total bandwidth of 936GB/s. The RTX 3080 has 10GB on a 320-bit bus, for 760GB/s of bandwidth. RTX 3070 is the less impressive solution, matching the previous-generation 2080 down to 2060 Super with “only” 448GB/s. Nvidia has changed other aspects of the architecture to help make better use of
the memory bandwidth, however, so raw bandwidth alone doesn’t tell the full story.
One new addition is EDR: Error Detection and Replay. This is a cheaper alternative to ECC (Error Correcting Code) and enables the GPU to recover from errors in data transmission. If the memory subsystem detects a transmission error, it simply retries until it succeeds. This potentially allows for higher clocks that actually have lower performance, but it does mean that running close to the limit of the memory won’t be as likely to crash on infrequent errors.
While we’re on the subject of memory, let’s also note that the ROPs (Render Outputs) have been shifted out of the memory controllers and into the GPCs (Graphics Processing Clusters). This provides more flexibility, enabling Nvidia to have 96 ROPs on both the 3070 and 3080 even though the latter has two additional memory chips. That’s because the GA102 has up to seven GPCs of 12 SMs (Streaming Multiprocessors) each, and the 3080 has six GPCs enabled. The GA104, meanwhile, has six GPCs of only 8 SMs each.
Something else to keep an eye out for is future higher capacity RTX 3080 and 3070 configurations. The scuttlebutt is that Nvidia is waiting for AMD to reveal its RDNA 2 / RX 6000 lineup, which is expected to have 16GB on the top RX 6900 XT model. Then Nvidia will announce an RTX 3080 20GB card that costs $100 more than the RTX 3080. That may end up being just a rumor, but at least one manufacturer part list leaked with 20GB and 16GB RTX 3080 and 3070 cards listed.
CACHE, SHARED MEMORY,
PCIE GEN4, AND A FAREWELL TO SLI
Wrapping up the Ampere architecture, Nvidia also increased cache sizes and added more flexibility to the shared memory, so that it can be configured as varying amounts of L1 cache or shared memory. The L2 cache on the 3080 is 25 percent larger than on the 2080, and the L1 cache and shared memory capacity is 33 percent larger. Both of these changes improve overall memory throughput.
All of the Ampere GPUs are also fully PCIe Gen4 compliant. In practice, it doesn’t currently appear to make much difference, especially since the only PCIe Gen4 consumer platforms come from AMD. Intel CPUs are still generally faster for gaming, so until Intel catches up (with Rocket Lake and Alder Lake on desktops), most gamers will still be better off pairing Ampere with a PCIe Gen3 solution.
Finally, Nvidia has all but killed off SLI with Ampere. The RTX 3090 – yes, the $1,500 GPU – is the only consumer card that will support NVLink and SLI this round. The NVLink bandwidth has been doubled, however, meaning your Turing NVLink connectors are now outdated. So, $3,100 will get you a pair of 3090 cards with the new NVLink… and then you’re still dependent on game developers to support SLI.
That’s because Nvidia has explicitly made SLI support a developer choice, and as we’ve seen over the past few years, that means SLI is basically dead. Note that multi-GPU isn’t affected, so GPU compute workloads like Folding@Home will be fine.
PUMP UP THE AMPERES
As you can tell with this short – if you can believe that –overview, Ampere is a tour de force for Nvidia. Check out the specs tables, and any enthusiast is likely to start drooling. With Nvidia’s cards on the table, we now get to see if AMD can follow suit or maybe even snag a pot or two. We should know more about AMD’s plans by next month.
Ultimately, all of these architectural changes only matter so much. Eventually, we need to run games and benchmarks to see how Ampere stacks up in the real world. We’ve done just that with our RTX 3080 review. Spoiler alert: It’s damn fast.
We’re left wondering where Nvidia will go next. We’ve known about the Ampere codename since before Turing launched – and in fact, many thought we were getting Ampere two years ago instead of Turing. But looking forward, our crystal ball is very cloudy. We don’t know the codename for Nvidia’s post-Ampere GPUs. We also don’t know what process technology Nvidia will use.
TSMC’s N7 might be better than Samsung 8N, but even better than N7 is N7P, N7+, N6 (with EULV), and the new N5 (which just started cranking out Apple’s A14 silicon). Will Ampere stick around for two years on 8N, or could it be a shorterlived architecture, with a die shrink to a more advanced 5nm node? We don’t know, but hopefully we don’t have to wait two years and add another 100W to find out.