IMO VLIW is an absurdly bad choice for a general purpose processor. It requires baking in a huge amount of low level micro-architectural details into the compiler / generated code. Which obviously leads to problems with choosing what hardware generation to optimize for / not being able to generate good code for future architectures.
And the compiler doesn't even come close to having as much information as the CPU has. Which basically means that most of the VLIW stuff just ends up needing to be broken up inside the CPU for good performance.
VLIW was the best implementation (20 years ago) of instruction level parallelism.
But what have we learned in these past 20 years?
* Computers will continue to become more parallel -- AMD Zen2 has 10 execution pipelines, supporting 4-way decode and 6-uop / clock tick dispatch per core, with somewhere close to 200 registers for renaming / reordering instructions. Future processors will be bigger and more parallel, Ice Lake is rumored to have over 300-renaming registers.
* We need assembly code that scales to all different processors of different sizes. Traditional assembly code is surprisingly good (!!!) at scaling, thanks to "dependency cutting" with instructions like "xor eax, eax".
* Compilers can understand dependency chains, "cut them up" and allow code to scale. The same code optimized for Intel Sandy Bridge (2011-era chips) will continue to be well-optimized for Intel Icelake (2021 era) ten years later, thanks to these dependency-cutting compilers.
I think a future VLIW chip can be made that takes advantage of these facts. But it wouldn't look like Itanium.
----------
EDIT: I feel like "xor eax, eax" and other such instructions for "dependency cutting" are wasting bits. There might be a better way for encoding the dependency graph rather than entire instructions.
Itanium's VLIW "packages" is too static.
I've discussed NVidia's Volta elsewhere, which has 6-bit dependency bitmasks on every instruction. That's the kind of "dependency graph" information that a compiler can provide very easily, and probably save a ton on power / decoding.
I agree there is merit in the idea of encoding instruction dependencies in the ISA. There have been a number of research projects in this area, e.g. wavescalar, EDGE/TRIPS, etc.
It's not only about reducing the need for figuring out dependencies at runtime, but you could also partly reduce the need for the (power hungry and hard to scale!) register file to communicate between instructions.
Main lesson: we failed to make all the software JIT-compiled or AOT-recompiled-on-boot or something, that would allow retargeting the optimizations for the new generation of a VLIW CPU. Barely anyone even tried. Well I guess in the early 2000s there was this vision that everything would be Java, which is JIT, but lol
Your point seems invalid, in the face of a large chunk of HPC (neural nets, matrix multiplication, etc. etc.) getting rewritten to support CUDA, which didn't even exist back when Itanium was announced.
VLIW is a compromise product: its more parallel than a traditional CPU, but less parallel than SIMD/GPUs.
And modern CPUs have incredibly powerful SIMD engines: AVX2 and AVX512 are extremely fast and parallel. There are compilers that auto-vectorize code, as well as dedicated languages (such as ipsc) which work for SIMD.
Encoders, decoders, raytracers, and more have been rewritten for Intel AVX2 SIMD instructions, and then re-rewritten for GPUs. The will to find faster execution has always existed, but unfortunately, Itanium failed to perform as well as its competition.
I'm not talking about rewrites and GPUs. I'm saying we do not have dynamicrecompilation of everything. As in – if we would have ALL binaries that run on the machine (starting with the kernel) stored in some portable representation like wasm (or not-fully-portable-but-still-reoptimizable like llvm bitcode) and recompiled with optimization for the current exact processor when starting. Only that would solve the "new generation of VLIW-CPU needs very different compiler optimizations to perform, oops all your binaries are for first generation and they are slow now" problem.
GPUs do work like this – shaders recompiled all the time – so VLIW was used in GPUs (e.g. TeraScale). But on CPUs we have a world of optimized, "done" binaries.
All of this hackery with hundreds of registers just to continue to make a massively parallel computer look like an 80s processor is what something like Itanium would have prevented. Modern processors ended up becoming basically VLIW anyway, Itanium just refused to lie to you.
When standard machine code is written in a "Dependency cutting" way, then it scales to many different reorder registers. A system from 10+ years ago with only 100-reorder registers will execute the code with maximum parallelism... while a system today with 200 to 300-reorder buffers will execute the SAME code with also maximum parallelism (and reach higher instructions-per-clock tick).
That's why today's CPUs can have 4-way decoders and 6-way dispatch (AMD Zen and Skylake), because they can "pick up more latent parallelism" that the compilers have given them many years ago.
"Classic" VLIW limits your potential parallelism to the ~3-wide bundles (in Itanium's case). Whoever makes the "next" VLIW CPU should allow a similar scaling over the years.
-----------
It was accidental: I doubt that anyone actually planned the x86 instruction set to be so effectively instruction-level parallel. Its something that was discovered over the years, and proven to be effective.
Yes: somehow more parallel than the explicitly parallel VLIW architecture. Its a bit of a hack, but if it works, why change things?
I'm talking about a mythical / mystical VLIW architecture. Obviously, older VLIW designs have failed in this regards... but I don't necessarily see "future" VLIW processors making the same mistake.
Perhaps from your perspective, a VLIW architecture that fixes these problems wouldn't necessarily be VLIW anymore. Which... could be true.
> And the compiler doesn't even come close to having as much information as the CPU has.
Unless your CPU has a means for profiling where your pipeline stalls are coming from, combined with dynamic recompilation/reoptimization similar to IBM's project DAISY or HP's Dynamo.
It's not going to do well as out-of-order CPUs that make instruction re-optimization decisions for every instruction, but I wouldn't rule out software-controlled dynamic re-optimization getting most of the performance benefits of out-of-order execution with a much smaller power budget, due to not re-doing those optimization calculations for every instruction. There are reasons most low-power implementations are in-order chips.
Traditional compiler techniques may have struggled with maintaining code for different architectures, but a lot has changed in the last 15 years. The rise of widely used IR languages has led to compilers that support dozens of architectures and hundreds of instruction sets. And they are getting better all the time.
The compiler has nearly all of the information that the CPU has, and it has orders of magnitude more. At best, your CPU can think a couple dozen cycles ahead of what it is currently executing. The compiler can see the whole program, can analyze it using dozens of methodologies and models, and can optimize accordingly. Something like Link Time Optimization can be done trivially with a compiler, but it would take an army of engineers decades of work to be able to implement in hardware.
> At best, your CPU can think a couple dozen cycles ahead of what it is currently executing.
The 200-sized reorder buffer says otherwise.
Loads/stores can be reordered for 200+ different concurrent objects on modern Intel skylake (2015 through 2020) CPUs. And its about to get a bump to 300+ sized reorder buffers in Icelake.
Modern CPUs are designed to "think ahead" almost the entirety of DDR4 RAM Latency, allowing reordering of instructions to keep the CPU pipes as full as possible (at least, if the underlying assembly code has enough ILP to fill the pipelines while waiting for RAM).
> Something like Link Time Optimization can be done trivially with a compiler, but it would take an army of engineers decades of work to be able to implement in hardware.
You might be surprised at what the modern Branch predictor is doing.
If your "call rax" indirect call constantly calls the same location, the branch predictor will remember that location these days.
With proper profiling (say, reservoir sampling of instructions causing pipeline stalls), and dynamic recompilation/reoptimization like IBM's project DAISY / HP's Dynamo, you may get performance near a modern out-of-order desktop processor at the power budget of a modern in-order low-power chip.
You get instructions scheduled based on actual dynamically measured usage patterns, but you don't pay for dedicated circuits to do it, and you don't re-do those calculations in hardware for every single instruction executed.
It's not a guaranteed win, but I think it's worth exploring.
But once you do that, then you hardware optimize the interpreter, and then its no longer called a "dynamic recompiler", but instead a "frontend to the microcode". :-)
No doubt there is still room for a power-hungry out-of-order speed demon of an implementation, but you need to leave the door open for something with approximately the TDP of a very-low-power in-order-processor with performance closer to an out-of-order machine.
Neo: What are you trying to tell me? That I can dodge "call rax"?
Morpheus: No, Neo. I'm trying to tell you that when you're ready, you won't need "call rax".
---
Compiler has access to optimizations that are at the higher level of abstraction than what CPU can do. For example, the compiler can eliminate the call completely (i.e. inline the function), or convert a dynamic dispatch into static (if it can prove that an object will always have a specific type at the call site), or decide where to favor small code over fast code (via profile-guided optimization), or even switch from non-optimized code (but with short start-up time) to optimized code mid-execution (tiered compilation in JITs), move computation outside loops (if it can prove that the result is the same in all iterations), and many other things...
There is no way a compiler can do anything for an indirect call that goes one way for a while and the other afterwards. A branch predictor can get both with if not 100% accuracy about as close to it as you can possibly get.
My point was simply that the compiler may be in position to disprove the assumption that this call is in fact dynamic (it may actually be static) or that it has to be a call in the first place (and inline the function instead).
I'm certainly not arguing against branch predictors.
And the compiler doesn't even come close to having as much information as the CPU has. Which basically means that most of the VLIW stuff just ends up needing to be broken up inside the CPU for good performance.