If you want to really dive into the math and motivation for this entirely new cl...

psb217 · on Dec 12, 2023

Due to the particular form of the recurrent update of the hidden state, there's actually a parallel algorithm for computing the recurrence over length N in log(N) time via dynamic programming. Note, you don't save FLOPs, you just save "sequential depth" in the computation through clever parallelization. It's sort of an extension of the fast parallel scan described here: https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-co....