OK, I am one of the developers in onnxruntime team. Perviously working on ROCm EP now has been transfered to QNN EP. The following is purely devrant and the opinions are mine.
So ROCm already sucks whereas QNN sucks even harder!
The conclusion here is NVIDIA knows how to make software that just works.
AMD makes software that might work.
Qualcomm, however, knows zero piece of shit of how to make a useful software.
The dev experience is just another level of disaster with Qualcomm. Their tools and APIs return absolutely zero useful infomation about what error you are getting, just an error code that you can grep from their include headers from SDK. To debug an error code, you need strace to get the internal error string on the device. Their profiler merely gives you a trace that cannot be associated back to original computing logic with very high stddev on the runtime. Their docs website is not indexed by the MF search engine, not to say LLMs, so if you have any question, good luck then!
So if you don't have a reason to use QNN, just don't use it (and any other NPU you name it).
Back to the benchmark script. There is a lot of flaws as I can see.
1. the session is not warmed up and the iteraion is too small.
2. the onnx graph is too small, I suspect the onnxruntime overhead cannot be ignored in this case. Try stack more gemm in the graph instead of increasing the iteration naively.
3. the "htp_performance_mode": "sustained_high_performance" might give a lower perf compare to "burst" mode.
A more reliable way to benchmark might just dump the context binary[1] and dump context inputs[2] and run this with qnn-net-run to get rid of the onnxruntime overhead.
The symbolic library (type of activations) requires a branching at the very core of the kernel. GPU will need to serialized on these operations warp-wise.
To optimize, you might want to do a scan operation beforehand and dispatch to activation funcs in a warp specialized way, this, however, makes the global memory read/write non-coalesced.
You then may sort the input based on type of activations and store it in that order, this makes the gmem IO coalesced but requires gather and scatter as pre and post processing.
I am long sought after a CUDA or HIP compiler that target SPIR-V or DXIL. So that we can compile all thoes neural network kernels to almost all compute devices.
The requirements are,
1. extend on C++17 this means template meta programming works
so that cub or cutlass/cute works
2. AOT
3. and no shitshow!
The only thing that comes to close is circle[1].
- OpenCL is a no go, as it is purely C. And it is full of shitshow especially on Android devices. And the vendor drivers are the main source of shit, jit compile adds the other.
- Vulkan+GLSL is a no go. The degree of shitness is on par with OpenCL due to driver and jit compiler.
- slang[2] has the potential, but the meta programming part is not as strong as C++, existing libraries cannot be used.
The above conclusion is drawn from my work on OpenCL EP for onnxruntime. And it purely is a nightmare to work with thoes drivers and jit compilers. Hopefully Vcc can take compute shader more seriously.
At Vulkanised 2023 discussion round Khronos admited that they aren't going to improve GLSL any further, and ironicly rely on Microsoft's HLSL work as the main shader language to go alongside Vulkan.
Maybe there is something else discussed at Vulkanises 2024, but I doubt it.
There was some SYCL work to target Vulkan, but seems to have been a paper attempt and fizzled out.
At the time of the dev of the EP, the tooling is not that good as current. I have imagined a pipeline that HLSL compiles down to DXIL then go through spirv cross and then target wide variety of mobile devices with opencl runtime. But they are more focused on graphics part and is cannot work with the kernel execution model, not to even mention the structural control flow things, it definitely not going to work. The OpenGL does not work with the imagined pipeline, because IIRC it cannot consume SPIRV bytecode. Vulkan is so niche and discarded very early. The final result with opencl runtime and cl language worked but the driver is a mess [palmface]
> At Vulkanised 2023 discussion round Khronos admited that they aren't going to improve GLSL any further, and ironicly rely on Microsoft's HLSL work as the main shader language to go alongside Vulkan.
That sounds intriguing, but I haven't been able of finding any references to it(I guess it was discussed in the panel but the video of it is private) do you have any reference of more information into it?
It's still early days for Vcc, I outline the caveats in the landing page. While I'm confident the control-flow bits and whatnot will work robustly, there's a big open question when it comes to the fate of standard libraries, the likes of libstdc++ were not designed for this use-case.
We'll be working hard on it all the way to Vulkanized, if you have some applications you can get up and running by then, feel free to get in touch.
I think the driver ecosystem for Vulkan is rather high-quality but that's more my (biased!) opinion that something I have hard data on. The Mesa/NIR-based drivers in particular are very nice to work with!
Thoes "existing libraries" does not necessary mean stdc++, but some parallel primitive, and are essential to performance portability. For example, cub for scan and reduction, cutlass for dense linear algebra[1].
> I think the driver ecosystem for Vulkan is rather high-quality
Sorry, I meant OpenGL. At the time of evaluation, the market shared of vulkan on Android deivces is too small and been out of consideration at very early stage. I'd assume the state has changed a lot thereafter.
It is really good to see more projects take a shot on compiling C++ to GPU natively.
[1] cutlass itself is not portable, but the recently added cute is well portable as I evaluated. It provides a unified abstraction for hierarchical layout decomposition along with copy primitive and gemm primitive.
Edit: Nevermind, I think I have misunderstood the purpose of this project. I thought it was a CUDA competitor, but it seems like it is just a shading language compiler for graphics.
See https://github.com/google/clspv for an OpenCL implementation on Vulkan Compute. There are plenty of quirks involved because the two standards use different varieties of SPIR-V ("kernels" vs. "shaders") and provide different guarantees (Vulkan Compute doesn't care much about numerical accuracy). The Mesa folks are also looking into this as part of their RustiCL (a modern OpenCL implementation) and Zink (implementing OpenGL and perhaps OpenCL itself on Vulkan) projects.
It compiles CUDA/HIP C++ to SPIR-V that can run on top of OpenCL or Level Zero. (It does require OpenCL's compute flavored SPIR-V, instead of graphics flavored SPIR-V as seen in OpenGL or Vulkan. I also think it requires some OpenCL extensions that are currently exclusive to Intel NEO, but should on paper be coming to Mesa's rusticl implementation too.
Not exactly, both cl and glsl can be aot, but the runtime will be limited to some newer version and then the market coverage will be niche, those vendor are so lazy on updating the driver and fixing the compiler bugs...
You might also want to watch on helix, if PR 8675 is merged, then scheme will be the extension langauge, then we get the best of both world. Modal editing and scheme =)
There's also Lem, which has a good vim mode and is scriptable in Common Lisp (since it's built in CL) :D https://github.com/lem-project/lem/ It has: LSP support, a treeview, project-related commands, a directory mode, a POC git mode… with ncurses and SDL2 UIs.
the text rendering on the SDL2 frontend is much better than the last time I tried it, and performance is much improved. the install process used to be quite painful before but is very easy now. maybe sometime in the next 5 years i'll build enough willpower to start trying to port over all my stupid little emacs customizations. but it would be a bummer to lose tramp.
Helix is great and I have a plugin in mind and I aspire to get more acquainted with the lisp perspective (though I find emacs a bit much). It's gonna be fun.
Very good taste for an American.