Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It makes a lot more sense when programming it as a vector machine. The divergence/convergence stuff is the cuda/simt model, there's no need to use that on amdgpu. Branches are cheap(ish) when they're done on the scalar unit.

Coroutines aren't currently a thing on amdgpu but I think they should be.



> The divergence/convergence stuff is the cuda/simt model

Even on NVidia, you're "allowed" to diverge and converge. But its not efficient.

Optimal NVidia coding will force more convergence than divergence. That's innate to GPU architecture. Its more efficient to run 32-at-a-time per NVidia warp, than a diverged 8-at-a-time warp.

Yes, NVidia _CAN_ diverge and properly execute a subwarp of 8-at-a-time per clocktick... including with complex atomics and all that. But running a full 32-at-a-time warp is 400% the speed because its ALWAYS better to do more per clock tick than less-per-clock tick.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: