The next question: What prevents Intel or AMD from doing this on their processor...

jonplackett · on Nov 30, 2020

The article specifically answers this:

- x86 instruction set can't be queued up as easily because instructions have different lengths - 4 decoders max, while Apple has 8 and could go higher.

- Business model does not allow this kind of integration.

masklinn · on Nov 30, 2020

> instructions have different lengths

also allows extremely long instructions, the ISA will allow up to 15 bytes, and fault at 16 (without that artificial limit you can create arbitrarily long x86 instructions).

sterlind · on Nov 30, 2020

What a nightmare, but it makes me wonder: rather than decoding into micro-ops at runtime, could Intel or AMD "JIT" code up-front, in hardware, into a better bytecode?

I'm sure it wouldn't work for everything, but why wouldn't it be feasible to keep a cache of decoding results by page or something?

djcapelis · on Nov 30, 2020

This is exactly how the hardware works and what micro-ops are, on any system with a u-op cache or trace cache those decoded instructions are cached and used instead of decoding again. Unfortunately you still have to decode the instructions at least once first and that bottleneck is the one being discussed here. This is all transparent to the OS and not visible outside a low level instruction cache though, which means you don’t need a major OS change, but arguably if you were willing to take that hit you could go further here.

FeepingCreature · on Nov 30, 2020

So what stops x86 from adding the micro-ops as a dedicated alternate instruction set, Thumb style? Maybe with the implication that Intel will not hold the instruction set stable between chips, pushing vendors to compile to it on the fly?

freemint · on Nov 30, 2020

Mirco Ops are usually much wider than instructions of the ISA. They are usually not multiple of 8 bits wide either.

An dedicated alternative instruction set would be possible but that would take die space and make x86_64+mine even harder to decode.

oseityphelysiol · on Nov 30, 2020

From what I understand this is exactly what the instruction decoder does.

annilt · on Nov 30, 2020

They do something similar for ‘loops’. CPU doesn’t decode same instructions over and over again, just using them from ‘decoded instruction cache’ which has capacity around 1500 bytes.

puzzlingcaptcha · on Nov 30, 2020

Hmm, this reminds me of Transmeta https://en.wikipedia.org/wiki/Transmeta

monocasa · on Dec 1, 2020

They do this in a lot of designs. It's called a micro op cache, or sometimes an L0I cache.

rodgerd · on Dec 1, 2020

I think the latter is the biggest challenge.

I imagine that Apple's M1 are using what they know about MacOS, what they know about applications in the store, what user telemetry MacOS customers have opted into, all to build a picture of which problems are most important for them to solve for the type of customer who will be buying an M1-equipped MacOS device. They have no requirement to provide something that will work equally well for server, desktop, etc roles, for Windows and Linux, and they have a lot more information about what's actually running on a day-to-day basis.

hinkley · on Nov 30, 2020

They say in the article that AMD "can't" build more than 4 decoders. Is that really true? It could mean:

* we can't get a budget to sort it out

* 5 would violate information theory

* nobody wants to wrestle that pig, period

* there are 20 other options we'd rather exhaust before trying

When they've done 12 of those things and the other 8 turn out to be infeasible, will they call it quits, or will someone figure out how to either get more decoders or use them more effectively?

lern_too_spel · on Nov 30, 2020

Their business model allowed for them to integrate GPU and video decoder. Of course it allows for this kind of integration. The author is not even in that industry, so a lot of his claims are fishy. Moore's law is not about frequency, for example.

jonplackett · on Nov 30, 2020

I think what they mean is lack of coordination between software and hardware manufacturers + unwillingness of intel/amd etc to license their IP to dell etc. What is untrue about that?

On Moore's law, yes it's about transistors on a chip, but I think the point they're making is that Moore's law created a situation where you couldn't ramp up frequency anymore because the transistors are so small.

lern_too_spel · on Dec 1, 2020

> What is untrue about that?

The fact that they don't need to license technology. They can bring more functionality into the package that they sell to Dell, etc. like they have already done.

> but I think the point they're making is that Moore's law created a situation where you couldn't ramp up frequency anymore because the transistors are so small.

That is not the point they are making. Clock frequencies have not changed since the deep-pipelined P4, but transistor count has continued to climb. Here is what the author, who clearly does not know what he is talking about, said about that:

"increasing the clock frequency is next to impossible. That is the whole 'End of Moore’s Law' that people have been harping on for over a decade now."

crystaln · on Nov 30, 2020

Seems like backward compatibility completely rules this out. Apple can provide tools and clear migration paths because the chips are only used in their systems. Intel chips are used everywhere, and Intel has no visibility.

est31 · on Nov 30, 2020

I wonder if in 3 or 4 years there will be a new chipmaker startup which offers CPUs to the market similar to the M1. If Intel or AMD won't do it that is.

_ph_ · on Nov 30, 2020

The problems is competing with the size of Apple (and Intel and AMD). The reason there are so few competing high performance chips on the market is, that it is extremely expensive to design one. And you need a fab with a current process. For many years Intel beat all other companies, because they had the best fabs ans could affort the most R&D. Now Apple has the best fab - TSMCs 5nm - and certainly gigantic developer resources they can afford to spend as the chip design is pretty similar between the iPhones iPads and the Macs.

And of course, as mentioned in the article, any feature in the Apple Silicon will be supported by MacOS. Getting OS support for your new SOC features in other OSes is not a given.

simonh · on Nov 30, 2020

Qualcomm could come out with an M1 class chip, they have the engineering capability, but if Microsoft or Google don’t adopt it with a custom tuned OS and dev tooling customised for that specific architecture ready from day one, they’d lose a fortune.

The same goes in the other direction. If MS does all the work on ARM windows optimised for one vendor’s experimental chip and the chip vendor messes up or pulls out, they’d be shived. It’s too much risk for a company on either side of the table to wear on their own.

fomine3 · on Dec 1, 2020

Sadly Qualcomm won't develop own CPU core anymore since ARMv8 transition. Samsung stopped developing own core. So only ARM's code design is available for mobile.

rcthompson · on Nov 30, 2020

The article answers this: x86 has variable-sized instructions, which makes decoding much more complex and hard to parallelize since you don't know the starting point of each instruction until you read it.

gpderetta · on Nov 30, 2020

AMD until very recently (and Intel almost 25 years ago) used to mark instructions boundaries in the L1 I$ to speed up decode. They stopped recently for some reason though.

colejohnson66 · on Nov 30, 2020

For those unaware, ARM (as with many RISC-like architectures) uses 4 bytes for each instruction. No more. No less. (THUMB uses 2, but it’s a separate mode) x86, OTOH, being CISC-based in origin, has instructions ranging from a single byte all the way up to 15[a]).

[a]: It is possible to write instructions that would be 16 or more bytes, but the micro architecture will “throw” an illegal instruction exception if it encounters one. Intel doesn’t say why there’s a limit, but they do mention (in the SDM) that they may raise or remove the limit in the future.

Why 15 bytes? My guess is so the instruction decoder only needs 4 bits to encode the instruction length. A nice round 16 would need a 5th bit.

tzs · on Nov 30, 2020

> It is possible to write instructions that would be 16 or more bytes, but the micro architecture will “throw” an illegal instruction exception if it encounters one. Intel doesn’t say why there’s a limit

I remember noticing on the 80286 that you could in theory have an arbitrarily long instruction, and that with the right prefixes or instructions interrupts would be disabled while the instruction was read.

I wondered what would happen if you filled an entire segment with a single repeated prefix, but never got a chance to try it. Would it wrap during decoding, treating it as an infinite length instruction and thereby lock up the system?

My guess is that implementations impose a limit to preclude any such shenanigans.

saagarjha · on Nov 30, 2020

You could encode 16 bytes–there's no need to save a slot for zero.

colejohnson66 · on Dec 1, 2020

I honestly don’t know how the processor counts the instruction length, so it was only pure speculation on my part as to why the limit is 15. Maybe they naively check for the 4-bit counter overflowing to determine if they’ve reached a 16th byte? Maybe they do offset by 1 (b0000 is 1 and b1111 is 16) and check for b1111? I honestly have no idea, and I don’t think we’ll get an answer unless either (1) someone from Intel during x86’s earlier years chimes in, or (2) someone reverses the gates from die shots of older processors.

compiler-guy · on Nov 30, 2020

x86 instruction-length is not encoded in a single field. You have to examine it byte by byte to determine the total length.

There may be internal fields in the decoder that store this data, I suppose.

colejohnson66 · on Nov 30, 2020

Yes I am aware. My wording could’ve been better. I was referring to the (possible) internal fields in the decoder.