ARMing THE WORLD

Apple is dumping Intel processors for its own ARM-based designs. Will the traditional PC be next?

2020-10-05 -

No one would have believed in the last years of the 20th century that the processor world was being watched keenly and closely by intelligences greater than Intel’s. Yet in the UK, intellects vast and cool regarded the processor market with envious eyes, slowly and surely drawing their plans against Intel…

Apple is dumping Intel processors for its own design of silicon. How did this happen, and what, if any, are the ramifications to the wider PC market? How can a processor design that started life in an obscure, failed British home computer of the 1980s now challenge the entire Intel empire? We’re going to delve into the ARM microarchitecture, have a look at how it’s advanced over the years, how those architectural advances have borne out in benchmarks, and contrast the results to those of Intel desktop parts.

As we do this we’re going to find two contrasting stories: One of maximising performance increases generation by generation, and the other offering fixed incremental increases from generation to generation. We can delve behind the reasoning for why those increases played out like they did, and we can argue if ultimately those have been good decisions or not.

We can also argue about competition in the marketplace and ultimately how that’s good for us, the consumer. But an architecture running an entirely different instruction set – is that good for PC consumers? Perhaps we’re getting ahead of ourselves. NEIL MOHR

What is an Intel processor? What is a PC? When IBM was part-picking, it could have gone with its own IBM 801 RISC processor, but the budget insisted on the Intel 8088, and history was set: Every compatible PC would be running an x86compatible processor.

Technically anyone “could” design and manufacture an x86-compatible processor, but legally Intel owns the patents to the instruction set and has to license it for that to happen. If a company has ever produced a design or manufactured an x86 processor, it’s because Intel (or a court) allowed it to. AMD is different, as it has a complete cross-patent licensing agreement with Intel, so the two companies don’t end up suing each other into oblivion.

Over the years there has been a choice of different x86 manufacturers: IBM made a range of 386/486 processors, AMD, Cytrix, VIA, NEC, Transmeta, and some others, with the running theme usually being low-end designs. Intel has always been the x86 top dog, with the others (apart from AMD and the IBM days) being also-rans. So you could argue there has been competition in the market, if fighting over the dregs counts as competition.

The point is that there’s almost no competition for Intel in the market – even today with AMD doing well it commands just 18 percent (Mercury data) of the consumer market. AMD itself said it was aiming for 10 percent of the server market in 2020 and aiming to claw back to the heights of the Opteron days during 2006 of a heady 25 percent market share.

It’s fine to lament the lack of competition, but what can possibly change to break the status quo? The big recent announcement is that Apple will start to move away from Intel-based processors and switch all of its hardware to its own design of processor. Apple’s not talking about just laptops or low-end iMacs, but even its high-end workstation offerings that use the Intel Xeon. It’s a bold statement, but how is it going to manage?

Simpler times

As we’ve alluded to, it’s not going to be an x86-based processor. As you know, Apple makes the iPhone, which it turns out is rather popular. They also tend to be the fastest phones on the market, by quite some way. It could use whoever designs those mobile processors to come up with something for its desktop systems! Who designed those processors again? Apple, using the ARM-licensed Instruction Set Architecture (ISA).

For those that don’t know, ARM is a Reduced Instruction Set Computer (RISC), whereas x86 that is a Complex Instruction Set Computer (CISC).

Just to touch upon the background, the design philosophy behind RISC is to optimise the instruction set, to ensure instructions can be run in a single memory cycle and to eliminate unnecessary instructions to an optimised core. As time has progressed, instruction sets have grown, especially with specialised

cryptography/vector/SIMD functions. So the “reduced” part is now misleading.

There are interesting consequences from those design decisions. Optimising the instruction set reduces the number of transistors being used, and that reduces the amount of power required to do anything. So x86 and its CISC design inherently requires more transistors to do any computational job and so more power. On desktop this isn’t much of an issue, but with laptops and phones every watt counts.

This is why Intel’s half-bothered (sure, you can argue) stab at the phone market failed, which, considering it does networking, you’d have thought would be a no-brainer. Its x86 Atom offerings were too power-hungry and didn’t offer enough speed to differentiate the phones that used it. If Intel had backed it with more up-to-date process technology the story could have been different, but understandably it wanted to protect its core business.

Join the Army

We’re going to look at how both the ARM architecture and Intel desktop processors have improved over time (which is charted on the previous page). It’s notoriously hard to correctly compare two different architectures, so we’re not, we’re just going to look at Geekbench results as best we can. For ARM, we’re going to look at scores through the lens of Apple cores (used in just the major iPhone releases) as they’re the most performant, while for Intel we’ll focus on base Core i5 and Core i7 models in corresponding release years. We’ll keep an eye on the percentage speed increases. Alongside this we’ll delve into the major changes in the architecture – mainly on the ARM side, but it never hurts to go back over the Intel updates.

ARM is an IP company that designs the specification for the ARM ISA and updates it with new technology, such as its big.LITTLE core design, NEON SIMD instructions, and enhanced floating point units, for example. Typically it gives each new family release – the overall package of features – a name, such as Cortex-Ax; for Apple this started with the 32-bit ARMv7 within ARM11, and moves to designing its own micro architectures using the various updates to the 64-bit ARMv8.

In the original iPhone, Apple used a Samsung-designed SoC that was based on ARM11, which was actually introduced in 2003 using the ARMv6 microarchitecture. This was designed with early phone use in mind and introduced the first SIMD (Single Instruction Multiple Data) instructions for MPEG playback, improved cache (just 32K), and an eight-stage pipeline. With limited out-of-order execution and branch prediction, the performance can’t be taken as anything but weak.

The iPhone 3GS was the first really usable iPhone (in terms of software features). It stuck with a Samsung-designed SoC, but this used the updated Cortex-A8 core. Benchmarking shows a 107 percent increase in speed – put this down to the introduction of a dual-issue, superscalar 13-stage pipeline, backed with a 10-stage NEON SIMD pipeline for media acceleration. It doubled the L1 cache and introduced 256K L2 cache, and included a floating point unit. It’s this sort of “low-hanging fruit” that ARM and Apple was able to easily leverage at these early stages to drive doublings in speed.

The Apple A4 was the first in-house designed SoC, and while it debuted in the original iPad at 1GHz it was also used later in the iPhone 4, but at 800MHz. If Apple did Intel’s Tick-Tock design, this would be a Tock release. Still based on the Cortex-A8 architecture and the same Samsung 45nm process, it largely offered speed improvements via a clock increase and larger 512K L2 cache, but a key change was doubling the memory bus to 64-bit.

When Apple launched its iPad 2 it again introduced its all-new SoC – the Apple A5 at 1GHz – here first,

and into the iPhone 4S later at 800MHz. The Apple A5 was a significant release for Apple; it switched to the updated Cortex-A9 design, and it was the first dual-core release. Using the same Samsung 45nm process, the clock wasn’t increased, but the memory speed doubled to 400MHz, and the L2 cache doubled again to 1MB. The Cortex-A9 also introduced more fundamental key improvements, such as an eight-stage out-of-order speculative pipeline, enhanced NEON SIMD and double-speed FPU.

The release of the Apple A6 was when things started to get interesting from the point of view of Apple taking charge of its own design future and using its own ARMv7 design tricks. The Apple A6 was the last 32-bit design from Apple, and while it used the same size L1 and L2 cache as the A5, a process drop to 32nm, clock boost to 1.3GHz, and clever architectural introductions offered one of the biggest gen-on-gen increases, all the while using less power. The A6 appears to be based on the Cortex-A9 but used advanced parts of the Cortex-A15 design, including two of its (then) new v4 FPUs and Advanced SIMD v2. Analysis indicates it could issue three commands and use five execution units (2 ALU, 2 FPU/ NEON, 1 load/store) with a 12-deep pipeline. This massively enhanced the A6’s FPU prowess, and with optimised cache and a dedicated load/store unit, memory performance increased threefold, and overall speed doubled, again.

At this point Apple hit its stride, and the Apple A7 release was another mobile first: A 64-bit processor almost a year before anyone else. Using the ARMv8-A architecture on a Samsung 28nm process Apple added a 4MB L3 cache, doubled L2 to 1MB and L1 to 128KB. Apple basically doubled the width of its processor with this release: six-issue wide, four ALUs, two load/store units, two branch units, and three FPUs/NEONs units. With a billion transistors, that’s up 33 percent on the A6. For benchmarking we see the 32-bit only Geekbench 2 start to get long in the tooth, but Geekbench 3 points to the A7 Cyclone cores being twice as fast, again!

The Apple A8 remains a headscratcher in terms of speed; it feels like Apple concentrated more on the GPU side – introducing an in-house custom GPU shader – and perhaps the shift from Samsung (now an arch rival) to TSMC on a new 20nm process was another distraction. It’s a similar situation for the Apple A9 release, but utilising the TSMC 18nm and Samsung 14nm processes Apple could bump the clock to 1.8GHz, and tripled the L2 to 3MB.

The two big shifts for the Apple A10 were the introduction of the ARM big.LITTLE technology that enabled high-power and low-power cores to balance power consumption, plus a drop to the TSMC 16nm production. The easy win here was a jump to 2.3GHz speed, made easier with the introduction of the two low-power Zephyr cores, which ran at 1GHz and used just 20 percent the power of the “big” cores. Apple also moved to the newer ARMv8.1-A microarchitecture, though this was

an incremental update. This was the last Apple SoC to get a Geekbench 2 result, and we’d think all the increase is down to the clock increase, while newer Geekbench releases also include GPU elements that continued to increase significantly in speed.

The Apple A11 introduced a 2x big and 4x little core, it seems the small Mistral cores in the A11 were actually based on the Apple A6 Swift cores. Unlike in the A10, these could now work independently of the big cores – previously it was either or could be used. The big Monsoon cores were a major update in terms of the mid-core, moving from the six-wide decode to seven-wide. While in the backend was the addition of two integer ALU units, upping them from four to six units.

The Apple A12 was another advance for Apple, being the first commercial 7nm silicon. A big change was made to how the processor cache was organised, helping reduce latency and increase bandwidth. The general L3 cache was dropped in favor of an 8MB L2 system cache, and the L1 was doubled to 256K. The configuration was a little more complicated, split differently between the big.LITTLE cores – the A12 had two large and four small cores, and the small Tempest cores were Apple A6-based Swift cores.

The big Vortex cores actually had a single-thread turbo to 2.5GHz. The A11 and A12 were very wide architectures, even for desktopclass processors. With two complex units, two load/stores, two branch units, three FP/Vector units, that’s potentially 13 execution units.

The current, latest Apple A13 sees Apple doubling down on its new cache system, now called System Level Cache, which gets a whopping 16MB to service the SoC. The little (Thunder) cores get 4MB L2 and the big (Lightning) cores get 8MB L2. The overall design of the A13 appears to be a similar seven-wide decode front end with improvements to the multiplier and integer units. The seven percent boost to clock speed doesn’t account for the 14 to 20 percent speed increase Geekbench returns, even taking into account the 20 percent increase in the GPU.

There’s no doubt that Apple is going to compete with Intel on the desktop; its processor design is as wide as a desktop design, and its System Level Cache is as large and efficient, but it’s important to remember that this is unique in terms of ARM licensees. Apple is able to design such expensive silicon as it knows it’s going to sell them in premium-priced products. It’ll deliver better battery life and own another chunk of its device’s costs, knowing it’ll recoup its investment.

For 3rd-party processor manufacturers that model just isn’t possible. Take AMD: It’s never been able to compete with Intel and struggles even now when it’s doing well. So is an ARM-based processor manufacturer going to swan in and take over the desktop (or even laptop) market from Intel and AMD? No – on desktop where power consumption isn’t an issue and price is competitive, it’d be hard for anyone ARM-based to get a foothold.

Where ARM systems are targeting x86 is on mobile. Take the latest Lenovo Flex 5G that runs a Snapdragon 8cx SoC. We don’t have specifics on the SoC itself, but it uses the Cortex-A76 microarchitecture that is a four-way frontend decode, nine-port issue that has three ALUs, two FPU/SIMD units, two Load/Store units, and a Branch unit. While certainly capable, it’s a fraction of what Apple is putting into its current-gen silicon, and that plays out in the Snapdragon Geekbench 5 singlecore result of 716, less than half of the Apple A13. The Snapdragon is a quad-core part but ends up slower than the Apple A13.

With Intel stumbling over its process technology once again, Apple is at least matching its best core designs for performance, while ARM’s licensed cores are set to challenge Intel Core i5-level mobile cores. With AMD squeezing workstation and performance parts, ARM is being deployed in the lucrative HPC (high-performance computing) and server arena. There’s zero argument about it – Intel is getting squeezed from every direction.

?? ?? The 2015 Sky Lake architecture has been pottering along since 2015. — The 2015 Sky Lake architecture has been pottering along since 2015.

?? ?? Expect to see more Windowsrunning ARM systems using Qualcomm SoC. — Expect to see more Windowsrunning ARM systems using Qualcomm SoC.

ARMing THE WORLD

Apple is dumping Intel processors for its own ARM-based designs. Will the traditional PC be next?

Newspapers in English

Newspapers from Australia

ARMing THE WORLD

Apple is dumping Intel processors for its own ARM-based designs. Will the traditiona­l PC be next?

Newspapers in English

Newspapers from Australia

Apple is dumping Intel processors for its own ARM-based designs. Will the traditional PC be next?