I personally find the experience of writing GPU compute code pretty nice on graphics APIs. The interface is pretty much the same “dispatch a 1-3D set of 1-3D work group indices”.
The main pain points vs dedicated compute stuff like cuda is libraries and boilerplate to manage memory and launch kernels.
and, how to make the kernel and memory-allocation code working with tensorflow/pytorch, GPGPU is really now just a few libraries made for Tensorflow and Pytorch to invoke, same as CUDA, as far as ML is concerned.
The main pain points vs dedicated compute stuff like cuda is libraries and boilerplate to manage memory and launch kernels.