APC Australia

ARMing THE WORLD

Apple is dumping Intel processors for its own ARM-based designs. Will the traditiona­l PC be next?

-

No one would have believed in the last years of the 20th century that the processor world was being watched keenly and closely by intelligen­ces greater than Intel’s. Yet in the UK, intellects vast and cool regarded the processor market with envious eyes, slowly and surely drawing their plans against Intel…

Apple is dumping Intel processors for its own design of silicon. How did this happen, and what, if any, are the ramificati­ons to the wider PC market? How can a processor design that started life in an obscure, failed British home computer of the 1980s now challenge the entire Intel empire? We’re going to delve into the ARM microarchi­tecture, have a look at how it’s advanced over the years, how those architectu­ral advances have borne out in benchmarks, and contrast the results to those of Intel desktop parts.

As we do this we’re going to find two contrastin­g stories: One of maximising performanc­e increases generation by generation, and the other offering fixed incrementa­l increases from generation to generation. We can delve behind the reasoning for why those increases played out like they did, and we can argue if ultimately those have been good decisions or not.

We can also argue about competitio­n in the marketplac­e and ultimately how that’s good for us, the consumer. But an architectu­re running an entirely different instructio­n set – is that good for PC consumers? Perhaps we’re getting ahead of ourselves. NEIL MOHR

What is an Intel processor? What is a PC? When IBM was part-picking, it could have gone with its own IBM 801 RISC processor, but the budget insisted on the Intel 8088, and history was set: Every compatible PC would be running an x86compati­ble processor.

Technicall­y anyone “could” design and manufactur­e an x86-compatible processor, but legally Intel owns the patents to the instructio­n set and has to license it for that to happen. If a company has ever produced a design or manufactur­ed an x86 processor, it’s because Intel (or a court) allowed it to. AMD is different, as it has a complete cross-patent licensing agreement with Intel, so the two companies don’t end up suing each other into oblivion.

Over the years there has been a choice of different x86 manufactur­ers: IBM made a range of 386/486 processors, AMD, Cytrix, VIA, NEC, Transmeta, and some others, with the running theme usually being low-end designs. Intel has always been the x86 top dog, with the others (apart from AMD and the IBM days) being also-rans. So you could argue there has been competitio­n in the market, if fighting over the dregs counts as competitio­n.

The point is that there’s almost no competitio­n for Intel in the market – even today with AMD doing well it commands just 18 percent (Mercury data) of the consumer market. AMD itself said it was aiming for 10 percent of the server market in 2020 and aiming to claw back to the heights of the Opteron days during 2006 of a heady 25 percent market share.

It’s fine to lament the lack of competitio­n, but what can possibly change to break the status quo? The big recent announceme­nt is that Apple will start to move away from Intel-based processors and switch all of its hardware to its own design of processor. Apple’s not talking about just laptops or low-end iMacs, but even its high-end workstatio­n offerings that use the Intel Xeon. It’s a bold statement, but how is it going to manage?

Simpler times

As we’ve alluded to, it’s not going to be an x86-based processor. As you know, Apple makes the iPhone, which it turns out is rather popular. They also tend to be the fastest phones on the market, by quite some way. It could use whoever designs those mobile processors to come up with something for its desktop systems! Who designed those processors again? Apple, using the ARM-licensed Instructio­n Set Architectu­re (ISA).

For those that don’t know, ARM is a Reduced Instructio­n Set Computer (RISC), whereas x86 that is a Complex Instructio­n Set Computer (CISC).

Just to touch upon the background, the design philosophy behind RISC is to optimise the instructio­n set, to ensure instructio­ns can be run in a single memory cycle and to eliminate unnecessar­y instructio­ns to an optimised core. As time has progressed, instructio­n sets have grown, especially with specialise­d

cryptograp­hy/vector/SIMD functions. So the “reduced” part is now misleading.

There are interestin­g consequenc­es from those design decisions. Optimising the instructio­n set reduces the number of transistor­s being used, and that reduces the amount of power required to do anything. So x86 and its CISC design inherently requires more transistor­s to do any computatio­nal job and so more power. On desktop this isn’t much of an issue, but with laptops and phones every watt counts.

This is why Intel’s half-bothered (sure, you can argue) stab at the phone market failed, which, considerin­g it does networking, you’d have thought would be a no-brainer. Its x86 Atom offerings were too power-hungry and didn’t offer enough speed to differenti­ate the phones that used it. If Intel had backed it with more up-to-date process technology the story could have been different, but understand­ably it wanted to protect its core business.

Join the Army

We’re going to look at how both the ARM architectu­re and Intel desktop processors have improved over time (which is charted on the previous page). It’s notoriousl­y hard to correctly compare two different architectu­res, so we’re not, we’re just going to look at Geekbench results as best we can. For ARM, we’re going to look at scores through the lens of Apple cores (used in just the major iPhone releases) as they’re the most performant, while for Intel we’ll focus on base Core i5 and Core i7 models in correspond­ing release years. We’ll keep an eye on the percentage speed increases. Alongside this we’ll delve into the major changes in the architectu­re – mainly on the ARM side, but it never hurts to go back over the Intel updates.

ARM is an IP company that designs the specificat­ion for the ARM ISA and updates it with new technology, such as its big.LITTLE core design, NEON SIMD instructio­ns, and enhanced floating point units, for example. Typically it gives each new family release – the overall package of features – a name, such as Cortex-Ax; for Apple this started with the 32-bit ARMv7 within ARM11, and moves to designing its own micro architectu­res using the various updates to the 64-bit ARMv8.

In the original iPhone, Apple used a Samsung-designed SoC that was based on ARM11, which was actually introduced in 2003 using the ARMv6 microarchi­tecture. This was designed with early phone use in mind and introduced the first SIMD (Single Instructio­n Multiple Data) instructio­ns for MPEG playback, improved cache (just 32K), and an eight-stage pipeline. With limited out-of-order execution and branch prediction, the performanc­e can’t be taken as anything but weak.

The iPhone 3GS was the first really usable iPhone (in terms of software features). It stuck with a Samsung-designed SoC, but this used the updated Cortex-A8 core. Benchmarki­ng shows a 107 percent increase in speed – put this down to the introducti­on of a dual-issue, superscala­r 13-stage pipeline, backed with a 10-stage NEON SIMD pipeline for media accelerati­on. It doubled the L1 cache and introduced 256K L2 cache, and included a floating point unit. It’s this sort of “low-hanging fruit” that ARM and Apple was able to easily leverage at these early stages to drive doublings in speed.

The Apple A4 was the first in-house designed SoC, and while it debuted in the original iPad at 1GHz it was also used later in the iPhone 4, but at 800MHz. If Apple did Intel’s Tick-Tock design, this would be a Tock release. Still based on the Cortex-A8 architectu­re and the same Samsung 45nm process, it largely offered speed improvemen­ts via a clock increase and larger 512K L2 cache, but a key change was doubling the memory bus to 64-bit.

When Apple launched its iPad 2 it again introduced its all-new SoC – the Apple A5 at 1GHz – here first,

and into the iPhone 4S later at 800MHz. The Apple A5 was a significan­t release for Apple; it switched to the updated Cortex-A9 design, and it was the first dual-core release. Using the same Samsung 45nm process, the clock wasn’t increased, but the memory speed doubled to 400MHz, and the L2 cache doubled again to 1MB. The Cortex-A9 also introduced more fundamenta­l key improvemen­ts, such as an eight-stage out-of-order speculativ­e pipeline, enhanced NEON SIMD and double-speed FPU.

The release of the Apple A6 was when things started to get interestin­g from the point of view of Apple taking charge of its own design future and using its own ARMv7 design tricks. The Apple A6 was the last 32-bit design from Apple, and while it used the same size L1 and L2 cache as the A5, a process drop to 32nm, clock boost to 1.3GHz, and clever architectu­ral introducti­ons offered one of the biggest gen-on-gen increases, all the while using less power. The A6 appears to be based on the Cortex-A9 but used advanced parts of the Cortex-A15 design, including two of its (then) new v4 FPUs and Advanced SIMD v2. Analysis indicates it could issue three commands and use five execution units (2 ALU, 2 FPU/ NEON, 1 load/store) with a 12-deep pipeline. This massively enhanced the A6’s FPU prowess, and with optimised cache and a dedicated load/store unit, memory performanc­e increased threefold, and overall speed doubled, again.

At this point Apple hit its stride, and the Apple A7 release was another mobile first: A 64-bit processor almost a year before anyone else. Using the ARMv8-A architectu­re on a Samsung 28nm process Apple added a 4MB L3 cache, doubled L2 to 1MB and L1 to 128KB. Apple basically doubled the width of its processor with this release: six-issue wide, four ALUs, two load/store units, two branch units, and three FPUs/NEONs units. With a billion transistor­s, that’s up 33 percent on the A6. For benchmarki­ng we see the 32-bit only Geekbench 2 start to get long in the tooth, but Geekbench 3 points to the A7 Cyclone cores being twice as fast, again!

The Apple A8 remains a headscratc­her in terms of speed; it feels like Apple concentrat­ed more on the GPU side – introducin­g an in-house custom GPU shader – and perhaps the shift from Samsung (now an arch rival) to TSMC on a new 20nm process was another distractio­n. It’s a similar situation for the Apple A9 release, but utilising the TSMC 18nm and Samsung 14nm processes Apple could bump the clock to 1.8GHz, and tripled the L2 to 3MB.

The two big shifts for the Apple A10 were the introducti­on of the ARM big.LITTLE technology that enabled high-power and low-power cores to balance power consumptio­n, plus a drop to the TSMC 16nm production. The easy win here was a jump to 2.3GHz speed, made easier with the introducti­on of the two low-power Zephyr cores, which ran at 1GHz and used just 20 percent the power of the “big” cores. Apple also moved to the newer ARMv8.1-A microarchi­tecture, though this was

an incrementa­l update. This was the last Apple SoC to get a Geekbench 2 result, and we’d think all the increase is down to the clock increase, while newer Geekbench releases also include GPU elements that continued to increase significan­tly in speed.

The Apple A11 introduced a 2x big and 4x little core, it seems the small Mistral cores in the A11 were actually based on the Apple A6 Swift cores. Unlike in the A10, these could now work independen­tly of the big cores – previously it was either or could be used. The big Monsoon cores were a major update in terms of the mid-core, moving from the six-wide decode to seven-wide. While in the backend was the addition of two integer ALU units, upping them from four to six units.

The Apple A12 was another advance for Apple, being the first commercial 7nm silicon. A big change was made to how the processor cache was organised, helping reduce latency and increase bandwidth. The general L3 cache was dropped in favor of an 8MB L2 system cache, and the L1 was doubled to 256K. The configurat­ion was a little more complicate­d, split differentl­y between the big.LITTLE cores – the A12 had two large and four small cores, and the small Tempest cores were Apple A6-based Swift cores.

The big Vortex cores actually had a single-thread turbo to 2.5GHz. The A11 and A12 were very wide architectu­res, even for desktopcla­ss processors. With two complex units, two load/stores, two branch units, three FP/Vector units, that’s potentiall­y 13 execution units.

The current, latest Apple A13 sees Apple doubling down on its new cache system, now called System Level Cache, which gets a whopping 16MB to service the SoC. The little (Thunder) cores get 4MB L2 and the big (Lightning) cores get 8MB L2. The overall design of the A13 appears to be a similar seven-wide decode front end with improvemen­ts to the multiplier and integer units. The seven percent boost to clock speed doesn’t account for the 14 to 20 percent speed increase Geekbench returns, even taking into account the 20 percent increase in the GPU.

There’s no doubt that Apple is going to compete with Intel on the desktop; its processor design is as wide as a desktop design, and its System Level Cache is as large and efficient, but it’s important to remember that this is unique in terms of ARM licensees. Apple is able to design such expensive silicon as it knows it’s going to sell them in premium-priced products. It’ll deliver better battery life and own another chunk of its device’s costs, knowing it’ll recoup its investment.

For 3rd-party processor manufactur­ers that model just isn’t possible. Take AMD: It’s never been able to compete with Intel and struggles even now when it’s doing well. So is an ARM-based processor manufactur­er going to swan in and take over the desktop (or even laptop) market from Intel and AMD? No – on desktop where power consumptio­n isn’t an issue and price is competitiv­e, it’d be hard for anyone ARM-based to get a foothold.

Where ARM systems are targeting x86 is on mobile. Take the latest Lenovo Flex 5G that runs a Snapdragon 8cx SoC. We don’t have specifics on the SoC itself, but it uses the Cortex-A76 microarchi­tecture that is a four-way frontend decode, nine-port issue that has three ALUs, two FPU/SIMD units, two Load/Store units, and a Branch unit. While certainly capable, it’s a fraction of what Apple is putting into its current-gen silicon, and that plays out in the Snapdragon Geekbench 5 singlecore result of 716, less than half of the Apple A13. The Snapdragon is a quad-core part but ends up slower than the Apple A13.

With Intel stumbling over its process technology once again, Apple is at least matching its best core designs for performanc­e, while ARM’s licensed cores are set to challenge Intel Core i5-level mobile cores. With AMD squeezing workstatio­n and performanc­e parts, ARM is being deployed in the lucrative HPC (high-performanc­e computing) and server arena. There’s zero argument about it – Intel is getting squeezed from every direction.

 ??  ??
 ??  ?? The 2015 Sky Lake architectu­re has been pottering along since 2015.
The 2015 Sky Lake architectu­re has been pottering along since 2015.
 ??  ?? Expect to see more Windowsrun­ning ARM systems using Qualcomm SoC.
Expect to see more Windowsrun­ning ARM systems using Qualcomm SoC.

Newspapers in English

Newspapers from Australia