Just to qualify that "math execution" part, the beauty of Ray is that you get th...

Just to qualify that "math execution" part, the beauty of Ray is that you get threadpool-like features to speed up arbitrary python code. So not just parallelism, but state/variable sharing for relatively small data. So this is great for some optimizers and definitely RL (where your "math" is some really complicated simulation / loss logic), but Ray wouldn't make much sense for BLAS stuff. Am I missing something here?

Ray shows expertise in multi-machine that's lacking in stuff like Jax, Tensorflow, and PyTorch. Horovod nailed down a lot of the performance issues for SGD in particular, but is missing the sort of rapid deployment / distribution stuff in Ray. If only they could all work together ...