Thanks. I like simple things. Sums and products can get you surprisingly far. Co...

Thanks. I like simple things.

Sums and products can get you surprisingly far.

Conceptually it's simpler to think about and optimize. But you can also write it use einsum to do the sum product reductions (I've updated some comment to show how) to use less memory, but it's more intimidating.

You can probably use KeOps library to fuse it further (einsum would get in the way).

But the best is probably a custom kernel. Once you have written it as sums and product, it's just iterating. Like the core is 5 lines, but you have to add roughly 500 lines of low-level wrapping code to do cuda parallelisation, c++ to python, various types, manual derivatives. And then you have to add various checks so that there are no buffer overflows. And then you can optimize for special hardware operations like tensor cores. Making sure along the way that no numerical errors where introduced.

So there are a lot more efforts involved, and it's usually only worth it if the layer is promising, but hopefully AI should be able to autocomplete these soon.