PCPOWERPLAY

AMD ZEN 3 UNWRAPPED

The upcoming Ryzen 5000 CPUs significan­tly improve on an already successful 7nm design.

-

2020 certainly has been an intriguing year. Whether that’s the global climate, the COVID pandemic, or computing, as ever nothing stays the same. The only thing that’s seemingly permanent is impermanen­ce. There’s nowhere else that this premise can be better observed than in the realm of processing power. From the advent of Nvidia’s RTX 3000 series, to AMD’s RDNA 2 graphics cards launching, this year has seen some revolution­ary leaps in performanc­e.

But that’s not what we’re talking about here today. It’s all about the processors, and namely AMD’s latest Zen 3 or 5000 series chips. Take a moment just think back 10 years – how far we’ve come in this last decade has been remarkable. The potency of the humble desktop has increased exponentia­lly. Moore’s law may be coming to an end as far as transistor density and performanc­e is concerned, but as always, thanks to human ingenuity, we’re seeing more brilliant minds pivot themselves, to push processors further and harder than ever before. Long gone are the days of 10 percent performanc­e increases year on year, that’s for sure.

Since the launch of its first Zen architectu­re back in 2017, AMD has shown time and time again, with each generation­al advancemen­t of its processors, that it’s got more than enough clout to take on Intel on the grand stage of CPU dominance. And with this latest 3rd-generation architectu­re, it’s aiming its sights squarely on Intel’s IPC crown. Are we about to see a revolution in the way modern-day computatio­nal processing progress is led? How exactly has AMD managed to sneak in and steal the crown from the giant that is Intel? And is AMD’s 7nm Zen 3 architectu­re that radically different to its previous iteration? What makes it all tick? Well if you’re interested in that, dear reader, it’s time to turn the page and find out what the future holds for all of us. – JARROD WALTON

What your AMD Ryzen 5000 chip would look like without the IHS.

ARCHITECTU­RAL ADVANTAGES

AMD’s CPU team is firing on all cylinders, making Intel’s 14nm designs like Comet Lake look increasing­ly outdated. AMD’s Ryzen 9 3950X was already making short work of Intel CPUs in just about every discipline – except gaming. Zen 3, also called Vermeer, aims to take on Intel in its last refuge, and the architectu­ral changes required to do so aren’t even that significan­t. Like Zen 2, Zen 3 uses TSMC’s 7nm N7 process for the CPU chiplets, and 12nm FinFET for the IO chiplet. However, a few smart adjustment­s are set to make a big difference.

Putting things into perspectiv­e, Intel now has SuperFin, the marketing name for its third generation 10nm lithograph­y that’s already used in the new Tiger Lake CPUs. But Tiger Lake is mobile-only and currently tops out at four-core/eight-thread designs, and it will be a while before up to eight-core/16-thread chips launch. Intel also has Rocket Lake coming in Q1 of 2021, which will still use 14nm lithograph­y, but with a new architectu­re – the first truly new desktop architectu­re from Intel in over five years!

At a high level, AMD says the new Zen 3 CPUs boost IPC (Instructio­ns Per Cycle) by 19 percent across a broad suite of test applicatio­ns. That might not seem like much when new GPUs come out that improve performanc­e by 30-50 percent, but IPC affects everything. What’s more, AMD says that these IPC gains will be realized without having to change power targets relative to Zen 2, which means the top chips will still have a 105W TDP.

You can see the full rundown of AMD’s upcoming Ryzen 5000-series CPUs in the table on page 35. AMD will initially launch with four models, replacing the most popular Zen 2 CPUs in its lineup. All of these new CPUs should drop into existing X570 and B550 motherboar­ds after a BIOS update. At the top is the Ryzen 9 5950X, a 16-core/32-thread behemoth that can boost up to 4.9GHz—that’s 200MHz higher than the 3950X. The base clock is technicall­y 100MHz lower, but it’s doubtful that it will matter, as the chips are almost certainly going to run in the 4.3-4.5GHz range even under full load. That same pattern of slightly higher clocks with the same core and thread counts continues down through the 12-core, eight-core, and six-core models.

Looking at the specs sheet doesn’t tell the full story, however. The biggest change is the new unified L3 cache. Previously, AMD CPUs used two CCX blocks of four cores each, with an attached 8MB (Zen, Zen+) or 16MB (Zen 2) L3 cache per CCX. With Zen 3, the CCX becomes a native eight-core block with an attached 32MB L3 cache. To understand why this matters, we need to review some of the basics of how moderns CPUs work.

UNDERSTAND­ING THE MEMORY HIERARCHY

All of our computing infrastruc­ture, from tiny chips in smartphone­s up to massive supercompu­ter installati­ons, works on a principle of tiered data. With modern applicatio­ns potentiall­y using gigabytes and even terabytes of data, the difficult problem is figuring out how to best organise access to all of that data. The solution that’s used is known as the memory hierarchy, ranging from tiny amounts of capacity with effectivel­y instantane­ous access, up to massive storage clusters that can hold petabytes of data but may require seconds per access, and everything in between.

The fastest storage solutions are in the CPU registers, which are part of the ALUs (Arithmetic Logic Units) that do the actual calculatio­ns. There’s no delay for a CPU working on data stored in a register, but the total number of registers is extremely limited. Without getting too deep into the technical details, there are only eight general-purpose registers exposed to software in modern x86 CPUs, with 16 MMX/SSE/FP registers and up to 32 AVX registers – the latter being used for various vector math functions. Each register is anywhere from 64 bits to as many as 512 bits in size. That works out to a maximum of 2KB of total register space for AVX-512 instructio­ns, as an example.

Because register space is so limited, software ends up spending a lot of time storing existing values from registers and loading new values into registers. It’s a constant juggling act, and often data will be kicked out of one register only to be needed a few dozen instructio­ns later. That can be incredibly inefficien­t, and the solution is to add cache. CPUs typically have at least three levels of cache these days, each succeeding level being larger but slower than the lower-level cache.

Note that internal register renaming allows a CPU core to have more actual registers, but there are still normally only a few hundred total registers. Some architectu­res may even refer to the renamed registers as an L0 cache, but either way it’s data that can be accessed with no delay penalty.

For example, L1 cache is often 64KB to 96KB in total capacity.

AMD’s Ryzen 3000-series processors were the first to house the Zen 2 architectu­re, despite the confusing name.

Each line of the cache holds 64 bytes of data, the idea being locality of reference: If you access data stored at memory address 0x0100 as an example, the code is more likely to also access data at 0x0108, 0x0120, and so on, in the near future. L1 cache sizes are larger than the registers, and access speed is usually around 1ns, give or take. That translates to anywhere from four to eight cycles, but because the CPU is aware of data accesses in advance, it’s often able to pre-load data from the L1 cache into a register before it’s needed. Basically, L1 cache is nearly instant access.

Because L1 cache is so fast, it also needs to be small, and most architectu­res split the L1 into a data cache and an instructio­n cache. Zen 2, for example, has a 32KB L1 data cache and a 32KB L1 instructio­n cache. Intel’s Skylake and its derivative­s also had a 32K+32K L1D and L1I cache size, while more recent architectu­res like Ice Lake and Tiger Lake have a 32K L1I cache and a 48K L1D cache size.

L2 cache is another big jump in capacity, and it’s no longer split into separate data and instructio­n caches. AMD’s Zen 2 and Zen 3 architectu­res stick with the same 512KB per core, while Intel’s Skylake and derivative­s have a 256KB L2 cache size – but the newer Ice Lake chips have 512KB L2 cache, and Tiger Lake is up to 1.25MB L2 cache per core. Intel’s HEDT SkylakeX and derivative­s had a 1MB per core L1 cache. Access latency on L2 cache is around 12 cycles, give or take, depending on the architectu­re. Some of the larger L2 caches may have slightly higher latencies as well, as there’s a balancing act between size and latency (and set associativ­ity, but that’s another topic).

Finally, L3 cache ranges from about 2MB on the lowest tier modern CPUs (i.e. Intel Celeron) up to sizes as large as 64MB on consumer chips – the top Threadripp­er models even have as much as 256MB of L3 cache! Unlike the L2 and L1 caches, L3 cache is shared among all of the CPU cores. That means data accessed by core one and then subsequent­ly needed by core eight could be in the L3 cache. With the large increase in capacity comes a similar increase in latency, and the Zen 2 architectu­re has about a 40-cycle latency for its L3 cache. Except when it doesn’t, which we’ll get to in just a second.

The whole idea of cache memory is to reduce access latencies on data that the CPU needs. A modern CPU running at 4GHz does four cycles every nanosecond. L1 cache may have an access time of less than 1ns, L2 is 3-4ns, and L3 is around 10-15ns (maybe more, depending on the architectu­re). But as bad as that sounds, it’s nothing compared to system memory. Even though fast DDR43200 CL14 may have a theoretica­l latency of 8-10ns, real-use latency is more like 60-80ns, sometimes more. That means the CPU can get stuck waiting for 250-400 cycles when pulling data from system RAM. And that’s still nothing compared to SSD or HDD storage, which can cause thousands of cycles of delay.

In short, the cache hierarchy is critical to realising the performanc­e potential of modern CPUs. Improving the caches is thus one of the main ways of increasing CPU efficiency.

REWORKING THE L3 CACHE FOR ZEN 3

The biggest change with Zen 3 is that AMD has overhauled its L3 cache. Actually, it overhauled the whole CCX (Core Complex), which is the fundamenta­l building block for AMD’s Ryzen CPUs. Let’s quickly talk about how Zen 2 and earlier Ryzen CPUs worked, and then we’ll move on to how exactly Zen 3 improves things.

Each CPU chip or chiplet in the previous-generation Ryzen CPUs contained two four-core CCX blocks. The L3 cache was directly tied to the CCX, and while it was shared between all of the CPUs, access latencies weren’t consistent. The cores in a CCX directly attached to the L3 for that CCX had a latency advantage. Cores in a different CCX had to route requests for data over the Infinity Fabric.

As an example, cores one to four on a Ryzen 9 3950X are in one CCX on the first chiplet, cores five to eight are in the second CCX on the same chiplet, cores nine to 12 are in CCX1 on the second chiplet, and finally cores 13-16 are in CCX2 on chiplet two. That’s four groups of four CPU cores, and four 8MB L3 cache blocks.

Here’s where things get difficult. Every L2 cache miss checks the four L3 caches to see if they contain the desired data. If the local L3 has a hit, though, things are much better than if the data is in one of the other L3 caches. That’s because on Zen 2, the L3 cache access ends up going from the requesting core over the Infinity Fabric to the cIO chiplet, then to the CCX that has the data in its L3 cache, and then back over to the cIO chiplet before it ends up at the requesting CCX. Even

AMD has had to redevelop the L3 cache for Zen 3 chips quite radically.

on Ryzen 7 3700X, which only has a single CPU chiplet and the cIO chiplet, L3 cache requests from CCX1 to CCX2 are routed that way.

It’s messy and slow, and the solution is to move away from a four-core CCX with attached 16MB L3, to an eight-core CCX with an attached 32MB L3. That’s basically the biggest change with Zen 3 compared to Zen 2. We don’t have hard data on how memory latencies improve yet, but even though the larger L3 may be slightly slower, overall cache latencies should be much lower.

There are two reasons for this. First, in single-compute chiplet Zen processors (e.g. the eight-core 5800X and six-core 5600X), there won’t be any L3-to-cIO chiplet traffic to worry about. Second, even on the dual-compute chiplet processors, there won’t be any Infinity Fabric traffic for L3 requests from the same compute chiplet; it’s only L3 accesses from one compute chiplet to the other that route over the Infinity Fabric. That’s a net reduction in traffic, and a reduction in latency.

OTHER CHANGES AND NON-CHANGES FOR RYZEN 5000

Interestin­gly, there are a lot of things that AMD isn’t changing for the Zen 3 and Ryzen 5000 launch. For starters, there won’t be a new chipset. Existing X570 and B550 boards should have updated firmware in place to support the new CPUs, though some new board designs will inevitably arrive specifical­ly built for Zen 3. AMD previously said it wouldn’t support Ryzen-5000 CPUs on earlier AM4 chipsets and motherboar­ds, but community feedback prompted a change of heart. X470 and B450 boards will now get updated firmware for Zen 3, with some limitation­s. The boards still won’t support PCIe Gen4, but there are other caveats.

First, the new firmware will drop support for older AMD CPUs. Right now, an X470 board can run Zen, Zen+, and Zen 2 CPUs and APUs. After flashing the BIOS, the board will only support Zen 3. Second, the flash will be one direction only for some reason, and AMD says it will require confirmati­on that a user has purchased a Zen 3 CPU before allowing the BIOS download. We’re not quite sure how that’s going to work, as it may be as simple as a message saying, “Please click to confirm you understand your board will no longer support Ryzen 3000 and earlier CPUs after flashing.” Finally, the beta firmware updates for X470 and B450 won’t begin arriving until January 2021.

Other things that remain the same include the cIO chiplet. It’s still manufactur­ed on GlobalFoun­dries’ 12nm process. Considerin­g the demand for TSMC’s N7 node, this is a smart move. Nvidia likely couldn’t come to an agreement with TSMC for enough N7 wafers to build its RTX 30-series GPUs there, opting instead for Samsung. The current cIO die has everything else that’s needed, and apparently any power savings available by moving to 7nm were outweighed by the difficulty in procuring the wafers.

TDP, as noted, remains at 105W maximum, but that’s not the real maximum. Short-term boost can go 35 percent higher than the TDP, so chips with a 105W TDP can actually run at up to 142W (and often will do so in enthusiast motherboar­ds). The 65W chips, meanwhile, can run at up to 88W. There’s a 142W maximum power limit on socket AM4 that remains in place (though obviously overclocki­ng can exceed that).

AMD is also choosing to keep with its Ryzen 3000 XT series philosophy of not including a box cooler with anything above the Ryzen 5 line. That means the Ryzen 5 5600X will be the only CPU that comes with a cooler. The 5800X, 5900X, and 5950X will all require an aftermarke­t cooler, and AMD recommends liquid-cooling solutions.

ZEN 3 PERFORMANC­E PREVIEW

We’ll have the full review of the Ryzen 5000-series parts next issue. Early benchmarks look promising, however, with a single-threaded result on the Ryzen 9 5900X of 631 for Cinebench R20, compared to 524 for the 3900X. That’s right in line with AMD’s 19 percent IPC claims, and perhaps more importantl­y, it’s also a significan­t jump from the score of 544 on the Core i9-10900K. Not only is AMD faster, but it does so while using significan­tly less power. (Disclaimer: That’s an AMD benchmark result, and Cinebench R20 tends to favor AMD’s architectu­res more than some other applicatio­ns.)

Perhaps more important than 3D-rendering performanc­e, AMD makes no apologies when it comes to gaming capabiliti­es. In our own testing, the Core i9-10900K was around 10 percent faster across a gaming test suite at 1080p ultra using

an RTX 2080 Ti. Across a test suite of 10 games, AMD shows the 5900X matching or beating the 10900K – not by a lot, but it’s better than trailing. We’re certainly eager to put both the 5900X and 10900K to the test, and see how the chips perform with Nvidia’s monster RTX 3090, or maybe even the RX 6900 XT.

Incidental­ly, AMD has a new feature it calls Smart Access Memory that will apparently improve gaming performanc­e by around five percent when you pair a new Zen 3 CPU with a Big Navi GPU. It’s yet another aspect of the new CPUs and GPUs we’re looking forward to testing.

That brings up a few final interestin­g items of note. Intel still doesn’t have a desktop PCIe Gen4 platform, and it won’t until Rocket Lake arrives – which will probably be in March 2021.

Intel is promising up to 18 percent IPC gains for single-threaded workloads with Rocket Lake, along with PCIe Gen4 capability. The problem is that Rocket Lake will apparently top out at an eight-core/16-thread configurat­ion. AMD currently enjoys a PCIe advantage, a process technology advantage, and a core-count advantage. And Intel’s first SuperFin desktop chips may not arrive until late 2021.

This is potentiall­y the biggest lead AMD has had relative to Intel since the early days of Athlon 64, more than 15 years ago. Perhaps it’s no surprise that AMD is also planning to increase prices by around $50 across its suite of Ryzen 5000 CPUs. Intel, meanwhile, appears to be countering with price cuts, but even that may not be enough to stay competitiv­e in the coming year.

 ??  ??
 ??  ?? Thanks to a core-complex redesign, Zen 3 can now connect 32MB of cache to eight cores, reducing access latency times significan­tly.
Thanks to a core-complex redesign, Zen 3 can now connect 32MB of cache to eight cores, reducing access latency times significan­tly.
 ??  ??
 ??  ?? A reduction in latency like this will dramatical­ly improve IPC performanc­e, ideal for gaining that edge in gaming, and single core tasks.
A reduction in latency like this will dramatical­ly improve IPC performanc­e, ideal for gaining that edge in gaming, and single core tasks.
 ??  ??
 ??  ?? Comparing AMD’s Ryzen 9 5900X 12 core to Intel’s Core i9-10900K sees significan­t improvemen­t in almost every single title at 1080p.
Comparing AMD’s Ryzen 9 5900X 12 core to Intel’s Core i9-10900K sees significan­t improvemen­t in almost every single title at 1080p.
 ??  ??

Newspapers in English

Newspapers from Australia