- x86 instruction set can't be queued up as easily because instructions have different lengths - 4 decoders max, while Apple has 8 and could go higher.
- Business model does not allow this kind of integration.
also allows extremely long instructions, the ISA will allow up to 15 bytes, and fault at 16 (without that artificial limit you can create arbitrarily long x86 instructions).
What a nightmare, but it makes me wonder: rather than decoding into micro-ops at runtime, could Intel or AMD "JIT" code up-front, in hardware, into a better bytecode?
I'm sure it wouldn't work for everything, but why wouldn't it be feasible to keep a cache of decoding results by page or something?
This is exactly how the hardware works and what micro-ops are, on any system with a u-op cache or trace cache those decoded instructions are cached and used instead of decoding again. Unfortunately you still have to decode the instructions at least once first and that bottleneck is the one being discussed here. This is all transparent to the OS and not visible outside a low level instruction cache though, which means you don’t need a major OS change, but arguably if you were willing to take that hit you could go further here.
So what stops x86 from adding the micro-ops as a dedicated alternate instruction set, Thumb style? Maybe with the implication that Intel will not hold the instruction set stable between chips, pushing vendors to compile to it on the fly?
They do something similar for ‘loops’. CPU doesn’t decode same instructions over and over again, just using them from ‘decoded instruction cache’ which has capacity around 1500 bytes.
I imagine that Apple's M1 are using what they know about MacOS, what they know about applications in the store, what user telemetry MacOS customers have opted into, all to build a picture of which problems are most important for them to solve for the type of customer who will be buying an M1-equipped MacOS device. They have no requirement to provide something that will work equally well for server, desktop, etc roles, for Windows and Linux, and they have a lot more information about what's actually running on a day-to-day basis.
They say in the article that AMD "can't" build more than 4 decoders. Is that really true? It could mean:
* we can't get a budget to sort it out
* 5 would violate information theory
* nobody wants to wrestle that pig, period
* there are 20 other options we'd rather exhaust before trying
When they've done 12 of those things and the other 8 turn out to be infeasible, will they call it quits, or will someone figure out how to either get more decoders or use them more effectively?
Their business model allowed for them to integrate GPU and video decoder. Of course it allows for this kind of integration. The author is not even in that industry, so a lot of his claims are fishy. Moore's law is not about frequency, for example.
I think what they mean is lack of coordination between software and hardware manufacturers + unwillingness of intel/amd etc to license their IP to dell etc. What is untrue about that?
On Moore's law, yes it's about transistors on a chip, but I think the point they're making is that Moore's law created a situation where you couldn't ramp up frequency anymore because the transistors are so small.
The fact that they don't need to license technology. They can bring more functionality into the package that they sell to Dell, etc. like they have already done.
> but I think the point they're making is that Moore's law created a situation where you couldn't ramp up frequency anymore because the transistors are so small.
That is not the point they are making. Clock frequencies have not changed since the deep-pipelined P4, but transistor count has continued to climb. Here is what the author, who clearly does not know what he is talking about, said about that:
"increasing the clock frequency is next to impossible. That is the whole 'End of Moore’s Law' that people have been harping on for over a decade now."
Seems like backward compatibility completely rules this out. Apple can provide tools and clear migration paths because the chips are only used in their systems. Intel chips are used everywhere, and Intel has no visibility.
I wonder if in 3 or 4 years there will be a new chipmaker startup which offers CPUs to the market similar to the M1. If Intel or AMD won't do it that is.
The problems is competing with the size of Apple (and Intel and AMD). The reason there are so few competing high performance chips on the market is, that it is extremely expensive to design one. And you need a fab with a current process.
For many years Intel beat all other companies, because they had the best fabs ans could affort the most R&D. Now Apple has the best fab - TSMCs 5nm - and certainly gigantic developer resources they can afford to spend as the chip design is pretty similar between the iPhones iPads and the Macs.
And of course, as mentioned in the article, any feature in the Apple Silicon will be supported by MacOS. Getting OS support for your new SOC features in other OSes is not a given.
Qualcomm could come out with an M1 class chip, they have the engineering capability, but if Microsoft or Google don’t adopt it with a custom tuned OS and dev tooling customised for that specific architecture ready from day one, they’d lose a fortune.
The same goes in the other direction. If MS does all the work on ARM windows optimised for one vendor’s experimental chip and the chip vendor messes up or pulls out, they’d be shived. It’s too much risk for a company on either side of the table to wear on their own.
Sadly Qualcomm won't develop own CPU core anymore since ARMv8 transition. Samsung stopped developing own core. So only ARM's code design is available for mobile.
The article answers this: x86 has variable-sized instructions, which makes decoding much more complex and hard to parallelize since you don't know the starting point of each instruction until you read it.
AMD until very recently (and Intel almost 25 years ago) used to mark instructions boundaries in the L1 I$ to speed up decode. They stopped recently for some reason though.
For those unaware, ARM (as with many RISC-like architectures) uses 4 bytes for each instruction. No more. No less. (THUMB uses 2, but it’s a separate mode) x86, OTOH, being CISC-based in origin, has instructions ranging from a single byte all the way up to 15[a]).
[a]: It is possible to write instructions that would be 16 or more bytes, but the micro architecture will “throw” an illegal instruction exception if it encounters one. Intel doesn’t say why there’s a limit, but they do mention (in the SDM) that they may raise or remove the limit in the future.
Why 15 bytes? My guess is so the instruction decoder only needs 4 bits to encode the instruction length. A nice round 16 would need a 5th bit.
> It is possible to write instructions that would be 16 or more bytes, but the micro architecture will “throw” an illegal instruction exception if it encounters one. Intel doesn’t say why there’s a limit
I remember noticing on the 80286 that you could in theory have an arbitrarily long instruction, and that with the right prefixes or instructions interrupts would be disabled while the instruction was read.
I wondered what would happen if you filled an entire segment with a single repeated prefix, but never got a chance to try it. Would it wrap during decoding, treating it as an infinite length instruction and thereby lock up the system?
My guess is that implementations impose a limit to preclude any such shenanigans.
I honestly don’t know how the processor counts the instruction length, so it was only pure speculation on my part as to why the limit is 15. Maybe they naively check for the 4-bit counter overflowing to determine if they’ve reached a 16th byte? Maybe they do offset by 1 (b0000 is 1 and b1111 is 16) and check for b1111? I honestly have no idea, and I don’t think we’ll get an answer unless either (1) someone from Intel during x86’s earlier years chimes in, or (2) someone reverses the gates from die shots of older processors.