Itanuim deserved its fiery death and resurrection doesn't make any sense whatsoe...

darksaints · on Sept 14, 2020

Itanium was an excellent idea that needed investment in compilers. Nobody wanted to make that investment because speculative execution got them 80% of the way there without the investment in compilers. But as it turns out, speculative execution was a phenomenally bad idea, and patching its security vulnerabilities has set back processor performance to the point where VLIW seems like a good idea again. We should have made those compiler improvements decades ago.

dragontamer · on Sept 14, 2020

NVidia Volta: https://arxiv.org/pdf/1804.06826.pdf

Each machine instruction on NVidia Volta has the following information:

* Reuse Flags

* Wait Barrier Mask

* Read/Write barrier index (6-bit bitmask)

* Read Dependency barriers

* Stall Cycles (4-bit)

* Yield Flag (1-bit software hint: NVidia CU will select new warp, load-balancing the SMT resources of the compute unit)

Itanium's idea of VLIW was commingled with other ideas; in particular, the idea of a compiler static-scheduler to minimize hardware work at runtime.

To my eyes: the benefits of Itanium are implemented in NVidia's GPUs. The compiler for NVidia's compiler-scheduling flags has been made and is proven effective.

Itanium itself: the crazy "bundling" of instructions and such, seems too complex. The explicit bitmasks / barriers of NVidia Volta seems more straightforward and clear in describing the dependency graph of code (and therefore: the potential parallelism).

----------

Clearly, static-compilers marking what is, and what isn't, parallelizable, is useful. NVidia Volta+ architectures have proven this. Furthermore, compilers that can emit such information already exist. I do await the day when other architectures wake up to this fact.

StillBored · on Sept 14, 2020

GPU's, aren't general purpose compute. EPIC did fairly well with HPC/etc style applications as well, it was everything else that was problematic. So, yes there are a fair number of workload and microarch decision similarities. But right now, those workloads tend to be better handled with a GPU style offload engine (or as it appears the industry is slowly moving, possibly a lot of fat vector units attached to a normal core).

dragontamer · on Sept 14, 2020

I'm not talking about the SIMD portion of Volta.

I'm talking about Volta's ability to detect dependencies. Which is null: the core itself probably can't detect dependencies at all. Its entirely left up to the compiler (or at least... it seems to be the case).

AMD's GCN and RDNA architecture is still scanning for read/write hazards like any ol' pipelined architecture you learned in college. The NVidia Volta thing is new, and probably should be studied from a architectural point of view.

Yeah, its a GPU-feature on NVidia Volta. But its pretty obvious to me that this explicit dependency-barrier thing could be part of a future ISA, even one for traditional CPUs.

rrss · on Sept 14, 2020

FWIW, this article suggests the static software scheduling you are describing was introduced in Kepler, so it's probably at least not entirely new in Volta:

https://www.anandtech.com/show/5699/nvidia-geforce-gtx-680-r...

> NVIDIA has replaced Fermi’s complex scheduler with a far simpler scheduler that still uses scoreboarding and other methods for inter-warp scheduling, but moves the scheduling of instructions in a warp into NVIDIA’s compiler. In essence it’s a return to static scheduling

and I think this is describing more or less the same thing in Maxwell: https://github.com/NervanaSystems/maxas/wiki/Control-Codes

dragontamer · on Sept 14, 2020

I appreciate the info. Apparently NVidia has been doing this for more years than I expected.

StillBored · on Sept 14, 2020

I think your conflating OoO and speculative execution. It was OoO which the itanium architects (apparently) didn't think would work as well as it did. OoO and being able to build wide superscaler machines, which could dynamically determine instruction dependency chains is what killed EPIC.

Speculative execution is something you would want to do with the itanium as well, otherwise the machine is going to be stalling all the time waiting for branches/etc. Similarly, later itaniums went OoO (dynamically scheduled) because it turns out, the compiler can't know runtime state..

https://www.realworldtech.com/poulson/

Also while googling for that, ran across this:

https://news.ycombinator.com/item?id=21410976

PS: speculative execution is here to stay, it might be wrapped in more security domains and/or its going to just be one more nail in the business model of selling shared compute (something that was questionably from the beginning).

xgk · on Sept 14, 2020

   questionably from the beginning

Agreed. If you look at what's the majority of compute loads (e.g. Instagram, Snap, Netflix, HPC) then that's (a) not particularly security critical, and (b) so big that the vendors can split their workload in security critical / not security critical, and rent fast machines for the former, and secure machines for the latter.

I wonder which cloud provider is the first to offer this in a coherent way.

Quequau · on Sept 14, 2020

I dimly recall reading an interview with one of Intel's Sr. Managers on the Itanium project where he explained his thoughts on why Itanium failed.

His explanation centred on the fact that Intel decided early on that Itanium would only ever be an ultra high end niche product and only built devices which Intel could demand very high prices for. This in turn meant that almost no one outside of the few companies who were supporting Itanium development and certainly not most of the people who were working on other compilers and similar developer tools at the time, had any interest in working on Itanium because they simply could not justify the expense of obtaining the hardware.

So all the organic open source activity that goes on for all the other platforms which are easily obtainable by pedestrian users simply did not go on for Itanium. Intel did not plan on that up front (though in hindsight it seemed obvious) and by the time that was widely recognised within the management team no one was willing to devote the sort of scale of resources that were required for serious development of developer tools on a floundering project.

jabl · on Sept 14, 2020

> Itanium was an excellent idea that needed investment in compilers.

ISTR that Intel & HP spent well over a $billion on VLIW compiler R&D, with crickets to show for it all.

How much are you suggesting should be spent this time for a markedly different result?

drivebycomment · on Sept 14, 2020

By late 2000s, instruction scheduling research was largely considered done and dusted, with papers like:

https://dl.acm.org/doi/book/10.5555/923366 https://dl.acm.org/doi/10.1145/349299.349318

and many, many others (it produced so many PhDs in 90s). And, needless to say, HP and Intel hired so many excellent researchers during the heydays of Itanium. So I don't know on what basis you think there wasn't enough investment. So I have no choice but to assume you're ignorant of the actual history here, both in academics and industry.

It turns out instruction scheduling can not overcome the challenge of variable memory and cache latency, and branch prediction, because all of those are dynamic and unpredictable, for "integer" application (i.e. bulk of the code running on the CPUs of your laptop and cell phones). And, predication, which was one of the "solutions" to overcome branch misprediction penalties, turns out to be not very efficient, and is limited in its application.

For integer applications, it turns out the instruction level parallelism isn't really the issue. It's about how to generate and maintain as many outstanding cache misses at a time. VLIW turns out to be insufficient and inefficient for that. Some minor attempts are addressing that through prefetches and more elaborate markings around load/store all failed to give good results.

For HPC type workload, it turns out data parallelism and thread-level parallelism are much more efficient way to improve the performance, and also makes ILP on a single instruction stream play only a very minor role - GPUs and ML accelerators demonstrate this very clearly.

As for the security and the speculative execution, speculative execution is not going anywhere. Naturally, there are many researches around this like:

https://ieeexplore.ieee.org/abstract/document/9138997 https://dl.acm.org/doi/abs/10.1145/3352460.3358306

and while it will take a while before the real pipeline implements ideas like above thus we may continue to see some smaller and smaller vulnerabilities as the industry collectively plays whack-a-mole game, I don't see a world where the top of the line general-purpose microprocessor giving up on speculative execution, as the performance gain is simply too big.

I have yet to meet any academics or industry processor architects or compiler engineer who think VLIW / Itanium is the way to move forward.

This is not to say putting as much work to the compiler is a bad idea, as nVidia has demonstrated. But what they are doing is not VLIW.