AMD ZEN 3 UNWRAPPED
The upcoming Ryzen 5000 CPUs significantly improve on an already successful 7nm design.
2020 certainly has been an intriguing year. Whether that’s the global climate, the COVID pandemic, or computing, as ever nothing stays the same. The only thing that’s seemingly permanent is impermanence. There’s nowhere else that this premise can be better observed than in the realm of processing power. From the advent of Nvidia’s RTX 3000 series, to AMD’s RDNA 2 graphics cards launching, this year has seen some revolutionary leaps in performance.
But that’s not what we’re talking about here today. It’s all about the processors, and namely AMD’s latest Zen 3 or 5000 series chips. Take a moment just think back 10 years – how far we’ve come in this last decade has been remarkable. The potency of the humble desktop has increased exponentially. Moore’s law may be coming to an end as far as transistor density and performance is concerned, but as always, thanks to human ingenuity, we’re seeing more brilliant minds pivot themselves, to push processors further and harder than ever before. Long gone are the days of 10 percent performance increases year on year, that’s for sure.
Since the launch of its first Zen architecture back in 2017, AMD has shown time and time again, with each generational advancement of its processors, that it’s got more than enough clout to take on Intel on the grand stage of CPU dominance. And with this latest 3rd-generation architecture, it’s aiming its sights squarely on Intel’s IPC crown. Are we about to see a revolution in the way modern-day computational processing progress is led? How exactly has AMD managed to sneak in and steal the crown from the giant that is Intel? And is AMD’s 7nm Zen 3 architecture that radically different to its previous iteration? What makes it all tick? Well if you’re interested in that, dear reader, it’s time to turn the page and find out what the future holds for all of us. – JARROD WALTON
What your AMD Ryzen 5000 chip would look like without the IHS.
ARCHITECTURAL ADVANTAGES
AMD’s CPU team is firing on all cylinders, making Intel’s 14nm designs like Comet Lake look increasingly outdated. AMD’s Ryzen 9 3950X was already making short work of Intel CPUs in just about every discipline – except gaming. Zen 3, also called Vermeer, aims to take on Intel in its last refuge, and the architectural changes required to do so aren’t even that significant. Like Zen 2, Zen 3 uses TSMC’s 7nm N7 process for the CPU chiplets, and 12nm FinFET for the IO chiplet. However, a few smart adjustments are set to make a big difference.
Putting things into perspective, Intel now has SuperFin, the marketing name for its third generation 10nm lithography that’s already used in the new Tiger Lake CPUs. But Tiger Lake is mobile-only and currently tops out at four-core/eight-thread designs, and it will be a while before up to eight-core/16-thread chips launch. Intel also has Rocket Lake coming in Q1 of 2021, which will still use 14nm lithography, but with a new architecture – the first truly new desktop architecture from Intel in over five years!
At a high level, AMD says the new Zen 3 CPUs boost IPC (Instructions Per Cycle) by 19 percent across a broad suite of test applications. That might not seem like much when new GPUs come out that improve performance by 30-50 percent, but IPC affects everything. What’s more, AMD says that these IPC gains will be realized without having to change power targets relative to Zen 2, which means the top chips will still have a 105W TDP.
You can see the full rundown of AMD’s upcoming Ryzen 5000-series CPUs in the table on page 35. AMD will initially launch with four models, replacing the most popular Zen 2 CPUs in its lineup. All of these new CPUs should drop into existing X570 and B550 motherboards after a BIOS update. At the top is the Ryzen 9 5950X, a 16-core/32-thread behemoth that can boost up to 4.9GHz—that’s 200MHz higher than the 3950X. The base clock is technically 100MHz lower, but it’s doubtful that it will matter, as the chips are almost certainly going to run in the 4.3-4.5GHz range even under full load. That same pattern of slightly higher clocks with the same core and thread counts continues down through the 12-core, eight-core, and six-core models.
Looking at the specs sheet doesn’t tell the full story, however. The biggest change is the new unified L3 cache. Previously, AMD CPUs used two CCX blocks of four cores each, with an attached 8MB (Zen, Zen+) or 16MB (Zen 2) L3 cache per CCX. With Zen 3, the CCX becomes a native eight-core block with an attached 32MB L3 cache. To understand why this matters, we need to review some of the basics of how moderns CPUs work.
UNDERSTANDING THE MEMORY HIERARCHY
All of our computing infrastructure, from tiny chips in smartphones up to massive supercomputer installations, works on a principle of tiered data. With modern applications potentially using gigabytes and even terabytes of data, the difficult problem is figuring out how to best organise access to all of that data. The solution that’s used is known as the memory hierarchy, ranging from tiny amounts of capacity with effectively instantaneous access, up to massive storage clusters that can hold petabytes of data but may require seconds per access, and everything in between.
The fastest storage solutions are in the CPU registers, which are part of the ALUs (Arithmetic Logic Units) that do the actual calculations. There’s no delay for a CPU working on data stored in a register, but the total number of registers is extremely limited. Without getting too deep into the technical details, there are only eight general-purpose registers exposed to software in modern x86 CPUs, with 16 MMX/SSE/FP registers and up to 32 AVX registers – the latter being used for various vector math functions. Each register is anywhere from 64 bits to as many as 512 bits in size. That works out to a maximum of 2KB of total register space for AVX-512 instructions, as an example.
Because register space is so limited, software ends up spending a lot of time storing existing values from registers and loading new values into registers. It’s a constant juggling act, and often data will be kicked out of one register only to be needed a few dozen instructions later. That can be incredibly inefficient, and the solution is to add cache. CPUs typically have at least three levels of cache these days, each succeeding level being larger but slower than the lower-level cache.
Note that internal register renaming allows a CPU core to have more actual registers, but there are still normally only a few hundred total registers. Some architectures may even refer to the renamed registers as an L0 cache, but either way it’s data that can be accessed with no delay penalty.
For example, L1 cache is often 64KB to 96KB in total capacity.
AMD’s Ryzen 3000-series processors were the first to house the Zen 2 architecture, despite the confusing name.
Each line of the cache holds 64 bytes of data, the idea being locality of reference: If you access data stored at memory address 0x0100 as an example, the code is more likely to also access data at 0x0108, 0x0120, and so on, in the near future. L1 cache sizes are larger than the registers, and access speed is usually around 1ns, give or take. That translates to anywhere from four to eight cycles, but because the CPU is aware of data accesses in advance, it’s often able to pre-load data from the L1 cache into a register before it’s needed. Basically, L1 cache is nearly instant access.
Because L1 cache is so fast, it also needs to be small, and most architectures split the L1 into a data cache and an instruction cache. Zen 2, for example, has a 32KB L1 data cache and a 32KB L1 instruction cache. Intel’s Skylake and its derivatives also had a 32K+32K L1D and L1I cache size, while more recent architectures like Ice Lake and Tiger Lake have a 32K L1I cache and a 48K L1D cache size.
L2 cache is another big jump in capacity, and it’s no longer split into separate data and instruction caches. AMD’s Zen 2 and Zen 3 architectures stick with the same 512KB per core, while Intel’s Skylake and derivatives have a 256KB L2 cache size – but the newer Ice Lake chips have 512KB L2 cache, and Tiger Lake is up to 1.25MB L2 cache per core. Intel’s HEDT SkylakeX and derivatives had a 1MB per core L1 cache. Access latency on L2 cache is around 12 cycles, give or take, depending on the architecture. Some of the larger L2 caches may have slightly higher latencies as well, as there’s a balancing act between size and latency (and set associativity, but that’s another topic).
Finally, L3 cache ranges from about 2MB on the lowest tier modern CPUs (i.e. Intel Celeron) up to sizes as large as 64MB on consumer chips – the top Threadripper models even have as much as 256MB of L3 cache! Unlike the L2 and L1 caches, L3 cache is shared among all of the CPU cores. That means data accessed by core one and then subsequently needed by core eight could be in the L3 cache. With the large increase in capacity comes a similar increase in latency, and the Zen 2 architecture has about a 40-cycle latency for its L3 cache. Except when it doesn’t, which we’ll get to in just a second.
The whole idea of cache memory is to reduce access latencies on data that the CPU needs. A modern CPU running at 4GHz does four cycles every nanosecond. L1 cache may have an access time of less than 1ns, L2 is 3-4ns, and L3 is around 10-15ns (maybe more, depending on the architecture). But as bad as that sounds, it’s nothing compared to system memory. Even though fast DDR43200 CL14 may have a theoretical latency of 8-10ns, real-use latency is more like 60-80ns, sometimes more. That means the CPU can get stuck waiting for 250-400 cycles when pulling data from system RAM. And that’s still nothing compared to SSD or HDD storage, which can cause thousands of cycles of delay.
In short, the cache hierarchy is critical to realising the performance potential of modern CPUs. Improving the caches is thus one of the main ways of increasing CPU efficiency.
REWORKING THE L3 CACHE FOR ZEN 3
The biggest change with Zen 3 is that AMD has overhauled its L3 cache. Actually, it overhauled the whole CCX (Core Complex), which is the fundamental building block for AMD’s Ryzen CPUs. Let’s quickly talk about how Zen 2 and earlier Ryzen CPUs worked, and then we’ll move on to how exactly Zen 3 improves things.
Each CPU chip or chiplet in the previous-generation Ryzen CPUs contained two four-core CCX blocks. The L3 cache was directly tied to the CCX, and while it was shared between all of the CPUs, access latencies weren’t consistent. The cores in a CCX directly attached to the L3 for that CCX had a latency advantage. Cores in a different CCX had to route requests for data over the Infinity Fabric.
As an example, cores one to four on a Ryzen 9 3950X are in one CCX on the first chiplet, cores five to eight are in the second CCX on the same chiplet, cores nine to 12 are in CCX1 on the second chiplet, and finally cores 13-16 are in CCX2 on chiplet two. That’s four groups of four CPU cores, and four 8MB L3 cache blocks.
Here’s where things get difficult. Every L2 cache miss checks the four L3 caches to see if they contain the desired data. If the local L3 has a hit, though, things are much better than if the data is in one of the other L3 caches. That’s because on Zen 2, the L3 cache access ends up going from the requesting core over the Infinity Fabric to the cIO chiplet, then to the CCX that has the data in its L3 cache, and then back over to the cIO chiplet before it ends up at the requesting CCX. Even
AMD has had to redevelop the L3 cache for Zen 3 chips quite radically.
on Ryzen 7 3700X, which only has a single CPU chiplet and the cIO chiplet, L3 cache requests from CCX1 to CCX2 are routed that way.
It’s messy and slow, and the solution is to move away from a four-core CCX with attached 16MB L3, to an eight-core CCX with an attached 32MB L3. That’s basically the biggest change with Zen 3 compared to Zen 2. We don’t have hard data on how memory latencies improve yet, but even though the larger L3 may be slightly slower, overall cache latencies should be much lower.
There are two reasons for this. First, in single-compute chiplet Zen processors (e.g. the eight-core 5800X and six-core 5600X), there won’t be any L3-to-cIO chiplet traffic to worry about. Second, even on the dual-compute chiplet processors, there won’t be any Infinity Fabric traffic for L3 requests from the same compute chiplet; it’s only L3 accesses from one compute chiplet to the other that route over the Infinity Fabric. That’s a net reduction in traffic, and a reduction in latency.
OTHER CHANGES AND NON-CHANGES FOR RYZEN 5000
Interestingly, there are a lot of things that AMD isn’t changing for the Zen 3 and Ryzen 5000 launch. For starters, there won’t be a new chipset. Existing X570 and B550 boards should have updated firmware in place to support the new CPUs, though some new board designs will inevitably arrive specifically built for Zen 3. AMD previously said it wouldn’t support Ryzen-5000 CPUs on earlier AM4 chipsets and motherboards, but community feedback prompted a change of heart. X470 and B450 boards will now get updated firmware for Zen 3, with some limitations. The boards still won’t support PCIe Gen4, but there are other caveats.
First, the new firmware will drop support for older AMD CPUs. Right now, an X470 board can run Zen, Zen+, and Zen 2 CPUs and APUs. After flashing the BIOS, the board will only support Zen 3. Second, the flash will be one direction only for some reason, and AMD says it will require confirmation that a user has purchased a Zen 3 CPU before allowing the BIOS download. We’re not quite sure how that’s going to work, as it may be as simple as a message saying, “Please click to confirm you understand your board will no longer support Ryzen 3000 and earlier CPUs after flashing.” Finally, the beta firmware updates for X470 and B450 won’t begin arriving until January 2021.
Other things that remain the same include the cIO chiplet. It’s still manufactured on GlobalFoundries’ 12nm process. Considering the demand for TSMC’s N7 node, this is a smart move. Nvidia likely couldn’t come to an agreement with TSMC for enough N7 wafers to build its RTX 30-series GPUs there, opting instead for Samsung. The current cIO die has everything else that’s needed, and apparently any power savings available by moving to 7nm were outweighed by the difficulty in procuring the wafers.
TDP, as noted, remains at 105W maximum, but that’s not the real maximum. Short-term boost can go 35 percent higher than the TDP, so chips with a 105W TDP can actually run at up to 142W (and often will do so in enthusiast motherboards). The 65W chips, meanwhile, can run at up to 88W. There’s a 142W maximum power limit on socket AM4 that remains in place (though obviously overclocking can exceed that).
AMD is also choosing to keep with its Ryzen 3000 XT series philosophy of not including a box cooler with anything above the Ryzen 5 line. That means the Ryzen 5 5600X will be the only CPU that comes with a cooler. The 5800X, 5900X, and 5950X will all require an aftermarket cooler, and AMD recommends liquid-cooling solutions.
ZEN 3 PERFORMANCE PREVIEW
We’ll have the full review of the Ryzen 5000-series parts next issue. Early benchmarks look promising, however, with a single-threaded result on the Ryzen 9 5900X of 631 for Cinebench R20, compared to 524 for the 3900X. That’s right in line with AMD’s 19 percent IPC claims, and perhaps more importantly, it’s also a significant jump from the score of 544 on the Core i9-10900K. Not only is AMD faster, but it does so while using significantly less power. (Disclaimer: That’s an AMD benchmark result, and Cinebench R20 tends to favor AMD’s architectures more than some other applications.)
Perhaps more important than 3D-rendering performance, AMD makes no apologies when it comes to gaming capabilities. In our own testing, the Core i9-10900K was around 10 percent faster across a gaming test suite at 1080p ultra using
an RTX 2080 Ti. Across a test suite of 10 games, AMD shows the 5900X matching or beating the 10900K – not by a lot, but it’s better than trailing. We’re certainly eager to put both the 5900X and 10900K to the test, and see how the chips perform with Nvidia’s monster RTX 3090, or maybe even the RX 6900 XT.
Incidentally, AMD has a new feature it calls Smart Access Memory that will apparently improve gaming performance by around five percent when you pair a new Zen 3 CPU with a Big Navi GPU. It’s yet another aspect of the new CPUs and GPUs we’re looking forward to testing.
That brings up a few final interesting items of note. Intel still doesn’t have a desktop PCIe Gen4 platform, and it won’t until Rocket Lake arrives – which will probably be in March 2021.
Intel is promising up to 18 percent IPC gains for single-threaded workloads with Rocket Lake, along with PCIe Gen4 capability. The problem is that Rocket Lake will apparently top out at an eight-core/16-thread configuration. AMD currently enjoys a PCIe advantage, a process technology advantage, and a core-count advantage. And Intel’s first SuperFin desktop chips may not arrive until late 2021.
This is potentially the biggest lead AMD has had relative to Intel since the early days of Athlon 64, more than 15 years ago. Perhaps it’s no surprise that AMD is also planning to increase prices by around $50 across its suite of Ryzen 5000 CPUs. Intel, meanwhile, appears to be countering with price cuts, but even that may not be enough to stay competitive in the coming year.