Maximum PC

The Nvidia A100 is the Largest GPU Ever Created

- Jarred Walton Jarred Walton has been a PC and gaming enthusiast for over 30 years.

THE GPU WARS ARE HEATING UP. Nvidia finally took the wraps off its Ampere architectu­re and the GA100 GPU that will reign as the company’s new king. Or maybe emperor, because this is one monster of a chip. This is the next iteration of Nvidia’s datacenter and supercompu­ter ambitions, so it’s not going into a gaming graphics card. But damn if the chip still isn’t exciting!

Let’s start with the raw specs. The full GA100 chip has 128 SMs (streaming multiproce­ssor), which are the main building block of Nvidia GPUs. The previous-generation Volta GV100 for comparison maxed out at 84 SMs (with 80 enabled), so right away we’re looking at more than a 50 percent increase in potential performanc­e, though the initial Nvidia A100 will “only” have 108 SMs enabled. Like Volta (and Pascal GP100), each SM has 64 FP32 CUDA cores, plus 32 FP64 CUDA cores. Where it gets interestin­g is in the other cores.

Like Volta—and unlike Turing—there are no RT cores for ray tracing. That obviously puts the kibosh on these becoming gaming chips, if the size and transistor count didn’t already clue you in. More importantl­y, however, the Tensor cores are getting a major upgrade. The new A100 only has half as many Tensor cores per SM—four instead of eight—but those four cores provide twice the performanc­e of the previous-generation Tensor cores.

These 3rd-generation Tensor cores have a few new tricks, specifical­ly FP64 support (for science!), sparsity accelerati­on for matrices that aren’t fully populated, and a new TF32 format that has the range of FP32 with the precision of FP16. It’s basically like Google’s bfloat16 format, using 19 bits with 10 for the exponent, eight for the mantissa, and one bit for the sign. What that means is a massive jump in deep learning and number-crunching calculatio­ns.

The Tesla V100 had peak throughput of 125 TFLOPS for FP16 deep-learning operations, but 7.8 TFLOPS for FP64 calculatio­ns. The Nvidia A100 has peak throughput of 312 TFLOPS for FP16, and up to 624 TFLOPS for FP16 operations with sparse matrices. It also has half that level of performanc­e with TF32 operations and can do 19.5 TFLOPS of FP64 on the Tensor cores. That’s 2.5 times faster for FP64, and Nvidia says in FP32 workloads (using TF32) the A100 is up to 20 times faster than the V100.

What about “normal” graphics workloads, using the CUDA cores? There are 6,912 FP32 CUDA cores on the A100, so 35 percent more potential performanc­e, though boost clocks are slightly lower: 1410MHz vs. 1530MHz. That’s still at least 24 percent more theoretica­l performanc­e, and updates will likely make it much more.

For memory, the full GA100 chip has six 8GB HBM2 stacks running at 1,215MHz (up from 877.5MHz on V100), but the A100 accelerato­r only has five stacks enabled. That still means 40GB of memory on a 5120bit bus and 1.6 TB/s of bandwidth, and potentiall­y more on a future product. Power could be the limiting factor, as even the A100 with only 85 percent of the SMs enabled still has a rated TDP of 400W.

Above: The GA100 packs 54 billion transistor­s into an 826 mm² chip.

The A100 supports a new feature called multi-instance GPU (MIG) that allows a single A100 chip to be partitione­d into as many as seven “separate” GPUs, each delivering the equivalent instancing power of a single Tesla V100. Through MIG, Nvidia intends to serve the scale-out (more instances) as well as scaleup (more aggregate performanc­e) markets. MIG will be beneficial for inferencin­g workloads, running previously trained deep-learning networks, while the full A100 can be put to use training such workloads.

Want even more performanc­e? Stuff eight Nvidia A100 GPUs into a DGX A100 server, and you’ve got 156 TFLOPS of peak FP64 performanc­e, up to 2.5 PFLOPS of TF32 performanc­e, and 5 PFLOPS of FP16 performanc­e. All for just $199,000! Or a DGX A100 Superpod can house 140 DGX A100 systems with 700 PFLOPS of FP16 throughput. Needless to say, supercompu­ters intend to make good use of the A100, with plans for Exascale (ExaFLOPS) supercompu­ters well under way.

Right away we’re looking at more than a 50 percent increase in potential performanc­e

 ??  ??
 ??  ??

Newspapers in English

Newspapers from United States