Hacker Newsnew | past | comments | ask | show | jobs | submit | cloudhan's commentslogin

> The official project language is American English with ISO 8601 dates and metric units.

Very good taste for an American.


OK, I am one of the developers in onnxruntime team. Perviously working on ROCm EP now has been transfered to QNN EP. The following is purely devrant and the opinions are mine.

So ROCm already sucks whereas QNN sucks even harder!

The conclusion here is NVIDIA knows how to make software that just works. AMD makes software that might work. Qualcomm, however, knows zero piece of shit of how to make a useful software.

The dev experience is just another level of disaster with Qualcomm. Their tools and APIs return absolutely zero useful infomation about what error you are getting, just an error code that you can grep from their include headers from SDK. To debug an error code, you need strace to get the internal error string on the device. Their profiler merely gives you a trace that cannot be associated back to original computing logic with very high stddev on the runtime. Their docs website is not indexed by the MF search engine, not to say LLMs, so if you have any question, good luck then!

So if you don't have a reason to use QNN, just don't use it (and any other NPU you name it).

Back to the benchmark script. There is a lot of flaws as I can see.

1. the session is not warmed up and the iteraion is too small. 2. the onnx graph is too small, I suspect the onnxruntime overhead cannot be ignored in this case. Try stack more gemm in the graph instead of increasing the iteration naively. 3. the "htp_performance_mode": "sustained_high_performance" might give a lower perf compare to "burst" mode.

A more reliable way to benchmark might just dump the context binary[1] and dump context inputs[2] and run this with qnn-net-run to get rid of the onnxruntime overhead.

[1]: https://github.com/cloudhan/misc-nn-test-driver/blob/main/qn... [2]: https://github.com/cloudhan/misc-nn-test-driver/blob/main/qn...


NPU folks offen time say

> it's not enough time to get new silicon designs specifically for <blahblah>

Where blahblah stands for a model architecture that has caused a paradigm shift.

When you need a new silicon for a new model, you are already losing.


Yes.


This reminds me of Weight Agnostic Neural Networks https://weightagnostic.github.io/


Very unfriendly.

The symbolic library (type of activations) requires a branching at the very core of the kernel. GPU will need to serialized on these operations warp-wise.

To optimize, you might want to do a scan operation beforehand and dispatch to activation funcs in a warp specialized way, this, however, makes the global memory read/write non-coalesced.

You then may sort the input based on type of activations and store it in that order, this makes the gmem IO coalesced but requires gather and scatter as pre and post processing.


Wouldn't it be faster to calculate every function type and then just multiply them by 0s or 1s to keep the active ones?


That's pretty much how branching on GPUs already works.


couldn't you implement these as a texture lookup, where x is the input and the various functions are stacked in y? That should be quite fast on gpus.


You run or your code runs, choose one and choose it wisely ;)


The memory is too small to be useful nowadays.


I am long sought after a CUDA or HIP compiler that target SPIR-V or DXIL. So that we can compile all thoes neural network kernels to almost all compute devices.

The requirements are, 1. extend on C++17 this means template meta programming works so that cub or cutlass/cute works

2. AOT

3. and no shitshow!

The only thing that comes to close is circle[1].

- OpenCL is a no go, as it is purely C. And it is full of shitshow especially on Android devices. And the vendor drivers are the main source of shit, jit compile adds the other.

- Vulkan+GLSL is a no go. The degree of shitness is on par with OpenCL due to driver and jit compiler.

- slang[2] has the potential, but the meta programming part is not as strong as C++, existing libraries cannot be used.

The above conclusion is drawn from my work on OpenCL EP for onnxruntime. And it purely is a nightmare to work with thoes drivers and jit compilers. Hopefully Vcc can take compute shader more seriously.

[1]: https://www.circle-lang.org/

[2]: https://shader-slang.com/

[3]: https://github.com/microsoft/onnxruntime/tree/dev/opencl


What about HLSL, specially since it is a kind of C++ flavour, specially after HLSL2021 improvements?

https://devblogs.microsoft.com/directx/opening-hlsl-planning...

https://devblogs.microsoft.com/directx/announcing-hlsl-2021/

At Vulkanised 2023 discussion round Khronos admited that they aren't going to improve GLSL any further, and ironicly rely on Microsoft's HLSL work as the main shader language to go alongside Vulkan.

Maybe there is something else discussed at Vulkanises 2024, but I doubt it.

There was some SYCL work to target Vulkan, but seems to have been a paper attempt and fizzled out.

https://dl.acm.org/doi/fullHtml/10.1145/3456669.3456683


At the time of the dev of the EP, the tooling is not that good as current. I have imagined a pipeline that HLSL compiles down to DXIL then go through spirv cross and then target wide variety of mobile devices with opencl runtime. But they are more focused on graphics part and is cannot work with the kernel execution model, not to even mention the structural control flow things, it definitely not going to work. The OpenGL does not work with the imagined pipeline, because IIRC it cannot consume SPIRV bytecode. Vulkan is so niche and discarded very early. The final result with opencl runtime and cl language worked but the driver is a mess [palmface]


> OpenGL does not work with the imagined pipeline, because IIRC it cannot consume SPIRV bytecode.

What gave you that idea?

  $ eglinfo -a gl -p wayland | grep spirv
  GL_ARB_get_program_binary, GL_ARB_get_texture_sub_image, GL_ARB_gl_spirv, 
  GL_ARB_sparse_texture_clamp, GL_ARB_spirv_extensions, 
  GL_ARB_get_program_binary, GL_ARB_get_texture_sub_image, GL_ARB_gl_spirv, 
  GL_ARB_spirv_extensions, GL_ARB_stencil_texturing, GL_ARB_sync,
There it is: <https://registry.khronos.org/OpenGL/extensions/ARB/ARB_gl_sp...>. Not in OpenGL ES, though.


> At Vulkanised 2023 discussion round Khronos admited that they aren't going to improve GLSL any further, and ironicly rely on Microsoft's HLSL work as the main shader language to go alongside Vulkan.

That sounds intriguing, but I haven't been able of finding any references to it(I guess it was discussed in the panel but the video of it is private) do you have any reference of more information into it?

is it related to adding hlsl support to clang?



> Khronos admited that they [...] ironicly rely on Microsoft's HLSL work as the main shader language to go alongside Vulkan.

So Cg ultimately prevailed over GLSL. Can't say that disappoints me.


It's still early days for Vcc, I outline the caveats in the landing page. While I'm confident the control-flow bits and whatnot will work robustly, there's a big open question when it comes to the fate of standard libraries, the likes of libstdc++ were not designed for this use-case.

We'll be working hard on it all the way to Vulkanized, if you have some applications you can get up and running by then, feel free to get in touch.

I think the driver ecosystem for Vulkan is rather high-quality but that's more my (biased!) opinion that something I have hard data on. The Mesa/NIR-based drivers in particular are very nice to work with!


Thoes "existing libraries" does not necessary mean stdc++, but some parallel primitive, and are essential to performance portability. For example, cub for scan and reduction, cutlass for dense linear algebra[1].

> I think the driver ecosystem for Vulkan is rather high-quality

Sorry, I meant OpenGL. At the time of evaluation, the market shared of vulkan on Android deivces is too small and been out of consideration at very early stage. I'd assume the state has changed a lot thereafter.

It is really good to see more projects take a shot on compiling C++ to GPU natively.

[1] cutlass itself is not portable, but the recently added cute is well portable as I evaluated. It provides a unified abstraction for hierarchical layout decomposition along with copy primitive and gemm primitive.


Will C++17 parallel algorithms be supported?

https://on-demand.gputechconf.com/supercomputing/2019/pdf/sc...

Edit: Nevermind, I think I have misunderstood the purpose of this project. I thought it was a CUDA competitor, but it seems like it is just a shading language compiler for graphics.


SYCL/DPC++ are the only viable CUDA competitors I would say, assuming that the tooling gets feature parity.


circle lang is also very worth to check out.


See https://github.com/google/clspv for an OpenCL implementation on Vulkan Compute. There are plenty of quirks involved because the two standards use different varieties of SPIR-V ("kernels" vs. "shaders") and provide different guarantees (Vulkan Compute doesn't care much about numerical accuracy). The Mesa folks are also looking into this as part of their RustiCL (a modern OpenCL implementation) and Zink (implementing OpenGL and perhaps OpenCL itself on Vulkan) projects.


chipStar (formerly CHIP-SPV) might also be worth checking out: https://github.com/CHIP-SPV/chipStar

It compiles CUDA/HIP C++ to SPIR-V that can run on top of OpenCL or Level Zero. (It does require OpenCL's compute flavored SPIR-V, instead of graphics flavored SPIR-V as seen in OpenGL or Vulkan. I also think it requires some OpenCL extensions that are currently exclusive to Intel NEO, but should on paper be coming to Mesa's rusticl implementation too.


GCC supports nvptx and amd via openmp offloading and openacc. I have no idea of how well it works.


All of these are jit by the driver in the end though?


Not exactly, both cl and glsl can be aot, but the runtime will be limited to some newer version and then the market coverage will be niche, those vendor are so lazy on updating the driver and fixing the compiler bugs...


Having worked for a few of those lazy vendors, the aot usually just ends up being bitcode which is fully jitted later.


Might be the training code related with the model https://github.com/mistralai/megablocks-public/tree/pstock/m...


Mixtral-8x7B support --> Support new model

https://github.com/stanford-futuredata/megablocks/pull/45


You might also want to watch on helix, if PR 8675 is merged, then scheme will be the extension langauge, then we get the best of both world. Modal editing and scheme =)

PR 8675: https://github.com/helix-editor/helix/pull/8675


There's also Lem, which has a good vim mode and is scriptable in Common Lisp (since it's built in CL) :D https://github.com/lem-project/lem/ It has: LSP support, a treeview, project-related commands, a directory mode, a POC git mode… with ncurses and SDL2 UIs.


the text rendering on the SDL2 frontend is much better than the last time I tried it, and performance is much improved. the install process used to be quite painful before but is very easy now. maybe sometime in the next 5 years i'll build enough willpower to start trying to port over all my stupid little emacs customizations. but it would be a bummer to lose tramp.


I'm pretty excited about that.

Helix is great and I have a plugin in mind and I aspire to get more acquainted with the lisp perspective (though I find emacs a bit much). It's gonna be fun.


man, this will be so great! getting a plug-in system in place will be huge for helix. gimme that sweet sweet copilot.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: