NVIDIA RTX 2080 Ti
Start saving: Nvidia redefines the high end
YOU EXPECT a press release to make ambitious claims, but the one from Nvidia for its Turing GPU architecture is particularly striking. The headline tells us that it will “fundamentally change computer graphics,” and the company’s CEO, Jensen Huang, says that it is “Nvidia’s most important innovation in computer graphics in more than a decade.” Bold talk, but there’s substance behind it. Turing is more than a generational step: It brings genuinely fresh technology to the graphics card, with real-time ray tracing as star billing.
Turing follows on from Pascal, which has powered everything from the top-tier GTX 1080 Ti down to entry-level GTX 1050s. Nvidia also has its Volta architecture, released last year, with its machine-learning cores, but this has been kept to specialist markets, apart from the eye-watering $2,999 Titan V. Turing is aimed at the mass market, although it’s starting at the top. It’ll run alongside Pascal-powered cards for the foreseeable future.
The Turing architecture has similar clock speeds to Pascal, but increases the CUDA (Compute Unified Device Architecture— small parallel processing units) core count by 15–20 percent. There are improvements to the SMs (Streaming Multiprocessors), and in memory bandwidth, plus a handful of new graphics functions, but the cherry on top is the addition of dedicated RT and AI hardware.
One noticeable change is that Nvidia has launched three Turing GPUs. Pascal cards are built on the same dies, with sections enabled and disabled to suit. Turing cards have launched with three separate dies: the TU102, TU104, and TU106. The smaller TU106 is in the RTX 2070 card, the TU104 in the RTX 2080, and the big TU102 in the halo RTX 2080 Ti, although not yet in a fully functional form. Right, let’s start digging around in the transistors to see what there is to elicit such a buzz. Be warned: There are copious acronyms to come.
The 754 x 754mm TU102 is 60 percent larger than the biggest Pascal GPU, with 55 percent more transistors. It’s built around six GPCs (Graphics Processing Clusters), each of which has six TPCs (Texture Processing Clusters), along with a dedicated rasterization engine. Each TPC has a PolyMorph Engine (a fixed-function geometry pipeline), and two SMs. Inside these are the cores: 64 CUDA, eight Tensor, and one RT. All clear?
There are 12 32-bit GDDR6 memory controllers, giving 384-bit width in all. Each contains a cluster of eight ROPs (Render Outputs). In the RTX 2080 Ti card, one of these controllers is disabled, giving it 88 ROPs. The layout is not uniform across the dies; the TU104 has eight rather than twelve SMs per GPC, for example, and it, and the TU106, only has eight memory controllers.
At the heart of every GPU is a fundamental building block, called a Streaming Multiprocesor by Nvidia, and a Compute Unit by AMD. It’s in here that the majority of the hard work is done. The Turing architecture SM contains schedulers, graphics cores, cache, texturing units, and more.
Starting at the top are the CUDA cores; Turing has 64 per SM rather than Pascal’s 128. Over the past decade, Nvidia has had anywhere from 32 to 192 per SM. The company claims that the new architecture is more efficient with 64. Turing adds FP16 ( 16-bit floating point) support, typically twice as fast as FP32 operations, although not as commonly used in games. We also get two 64-bit CUDA cores per SM, for compatibility support. Volta has 32 of these in each SM, but unless you’re modeling a galaxy, that die space is better used elsewhere.
New for Turing is a dedicated integer pipeline that can run concurrently with floating-point cores. Nvidia estimates that games have about 35 integer instructions for every 100 floating-point instructions. Previously, all went through the same pipeline, so that’s a theoretical improvement of over a third. It makes the GPU core more like a CPU core, which can return two instructions per click. Turing has one of these for every CUDA.
Much of what we’ve covered so far is present in Pascal, or a subtle evolution of what we already have, but now we get to the genuinely innovative stuff. Inside each SM we have eight Tensor cores and one RT core. The RT core works on the headline ability of Turing cards: real-time ray tracing. This may well produce jaw-slackening results, but the math is fiendish.
The most commonly used ray-tracing algorithm is BVH, (Bounding Volume Hierarchy). This encapsulates objects into large simple volumes. If a ray doesn’t intersect with this object, no more calculations need be made. At its simplest, an entire 3D object may be defined as a single box. If a ray doesn’t intersect with this box, no more work is required on that ray for that object. If it does, we delve into a hierarchical tree of increasing detail, as the object is broken down into smaller and smaller parts. When the ray does intersect, that branch of the tree is followed. Thus we can get to the intersection point by only checking a fraction of the points on an object.
These calculations can be done elsewhere, on a CPU or GPU, but
even with BVH optimization, it’s a mammoth task that clogs whatever is assigned to it. Enter our RT cores. Dedicated to BVH calculations, these can run through the task about 10 times faster than the CUDA cores, and leave them free to work on other things while doing it.
It is not possible to say precisely how many rays the RT cores can calculate per second, as it depends on the BVH structure. So, performance figures are an approximation. Nvidia quotes the RT cores as capable of over 10 Giga Rays per second on the RTX 2080 Ti, and that each GR/s requires about 10 TFLOPS of computation.
Using a single ray per pixel can result in dozens or hundreds of calculations, and better results are achieved with more rays. If a scene has multiple reflective surfaces, things get extremely complicated. Traditional rasterization has got pretty good over the last 20 years, but certain things still present problems, such as realistic lighting, shadows, and reflections.
Reflections can be faked, but mirrors take serious work, as they require a complete new projection into the game world. Shadow maps can produce decent results, but need a lot of memory to get right, as well as careful placement of lights. Another problem is ambient occlusion, which says how exposed an object is to ambient ligting. Ray tracing masters all these.
Powerful as Turing is, we’re still a long way off the required horsepower for true real-time ray tracing. So, like any good GPU tech, we cheat. Hybrid rendering uses traditional rasterization technology to render all the polygons in a frame, then combines the result with selected ray-traced shadows, reflections, and/or refractions. The ray tracing ends up being much less complex, allowing for higher frame rates. There’s a balancing act between quality and performance. Casting more rays for a scene can improve the result at the cost of frame rate, and vice versa.
Taking the GeForce RTX 2080 Ti and its 10GR/s as a baseline, if we’re rendering a game at 1080p, that’s roughly two million pixels, and running at 60fps means 120 million pixels a second. Doing the math, a game could manage 80 rays per pixel if the GPU is doing nothing else. Move to 4K, and we drop to 20 rays per pixel.
Here is where those machinelearning Tensor cores can help. Nvidia’s DLSS (Deep Learning Super Sampling) is trained on a batch of game screenshots on Nvidia’s own supercomputer until it learns to do it effectively. The results are packaged into a neat file for the Tensor cores to run locally.
This enables games to render at lower resolutions without antialiasing, then use the Tensor cores to upscale and antialias the image. Or simply as a faster alternative to traditional TAA
(Temporal Anti-Aliasing) without upscaling, The results are better, too; lacking the transparency and blurring TAA artefacts exhibits.
Nvidia has shown some interesting demos of games rendered at 1080p, then upscaled to 4K, without the performance hit you’d expect. DLSS has gathered more support than ray tracing, too, with 25 games in the pipeline.
Denoising is another potent tool for ray tracing. Many path-tracing systems are good at creating quick but coarse scenes. Machine learning can reduce the coarse, speckled effect, without the huge computational overhead of running a more complete ray-traced image. Pixar reportedly used such a system to radically reduce rendering times on its animations.
The Tensor cores don’t work concurrently with the rest of the GPU, so while they’re busy, the rest of the silicon is fairly idle. This limits its use. Nvidia suggests that DLSS and denoising could run at 20 percent of the total frame time.
Helping to boost graphics performance is pushed as the Tensor cores’ main benefit, but there’s more they can do. Nvidia and Microsoft have created the DirectX Machine Learning API. Future games could use all sorts of AI tricks, such as voice control and more devious AI opponents.
These changes have meant that comparing performance between different generations of GPU has become complicated. For existing games that don’t use the new features, the old FP32 TFLOPS figure is a reasonable measure, but once you add hybrid ray tracing, it doesn’t fit the job, so Nvidia has devised a new metric: RTX-OPS.
Obviously, this measurement will favor RTX GPUs. In a game that makes full use of ray-tracing effects, the workload is distributed as follows: 80 percent running FP32 shading (what games currently spend time on), with 35 percent of that time on concurrent INT32 shading. Ray tracing is used for 40 percent of the time, and the output is processed by the Tensor cores for 20 percent of the time. This final polish is the only thing running at this point, as the Tensor cores need the GPU’s full attention.
Using this formula, we have an RTX-OPS figure for the Founders Edition of the RTX 2080 Ti of 78 RTX-OPS. How would a GTX 1080 Ti fare? It lacks the RT and Tensor cores, and can’t run the integer and floating point together, so its RTXOPS score would simply equal its TFLOPS score of 11.3. This looks lack-luster, but it’s apples and oranges: The RTX-OPS value is only of interest if you are running a hybrid ray-tracing engine.
To keep its new GPUs fed with data, Nvidia has moved its Turing cards to GDDR6 memory, and reworked the cache and memory subsystem. The L1 cache bandwidth has been doubled, and can now run as either 32K of L1 and 64K shared memory, or as 64K L1 cache and 32K shared. The L2 cache has been doubled, too.
The faster clock speeds of GDDR6 over GDDR5 help, but Turing goes further. Pascal already used lossless memory compression, and Turing has improved on this. Nvidia hasn’t provided details, but claims that the larger caches and improved compression have increased the effective bandwidth by 20–35 percent over Pascal.
Turing introduces a number of new functions, too. How much these are used is another matter. Nvidia’s VXAO (Voxel Ambient Occlusion) has, as far as we know, only been used in two games in two years.
Foremost of the new features is Mesh Shading, the next iteration of vertex, geometry, and tessellation shaders. The idea is to move the LOD (Level of Detail—we weren’t kidding about the number of acronyms) from the CPU to the GPU. There are some impressive demos of this. It needs to be implemented at the API level, so must be built into DirectX/ Vulkan before it can become widespread.
Next is VRS (Variable Rate Shading), which is about moving processing power to where it can have a tangible effect. Here it uses more shaders in important areas, and fewer where results are negligible. Nvidia suggests a 15 percent performance boost when used effectively. It can be used as part of MAS (Motion Adaptive Shading), where fast-moving objects need less work as they look blurred anyway, and CAS (Constant Adaptive Shading), where effort is focused on primary content, on the car in a driving game, for example.
Nvidia talked about two further features of Turing: MVR (Multi-View Rendering), an enhanced version of the Simultaneous Multi-Projection that was already in Pascal, and TSS (Texture Space Shading). Where SMP focused on two views and VR applications, MVP can do four views per pass, and removes some viewdependent attributes. It should help improve VR applications, especially with some of the newer VR headsets that have a wider field of view. TSS is another method for reducing the shading required,
and really only of interest if you are writing a game engine.
And finally, we have a boost to video encoding/decoding. Pascal wasn’t bad at this, but the results weren’t always as good as the x264 Fast profile running on a CPU. Turing aims to deliver better than x264 Fast, with almost no CPU load. If you’re streaming at 1080p, it won’t matter too much, as Pascal or a good CPU can cope, but move to 4K, and you risk dropped frames. Turing aims to serve 4K encoding with almost no CPU load.
Are we impressed? It is difficult not to be. The Turing architecture is a masterpiece from the leading graphics technology company. It’s the result of 10,000 engineeringyears of effort (says Nvidia). Not only do we get a decent step up in the classic GPU architecture, we get features that change the way graphics are realized. The hybrid approach, fusing rasterization with real-time ray tracing, and adding AI-powered processing enables something special. The road to full real-time ray tracing is here, years before most expected it.
We have the hardware, now we need the support. The current list of games is small, a dozen or so, but emerging tech generally starts small. Ray tracing is a better way to render scenes, and the prospect of using it in games, however partially, is too tantalizing to resist. Big names have lined up behind Turing, including Epic, with its Unreal Engine, and EA’s Frostbyte. Microsoft has plumbed ray tracing into DirectX, giving the allimportant route to standardization.
At some point in the next few months, or maybe more, you are going to see a game that pushes the graphics to such a point that you’re going to want a card that can run it, and want it very badly.