Nvidia doesn't have the software stack to do a TPU. They could make a systolic a...

zzzoom · 2025-11-28T00:22:22 1764289342

> They could make a systolic array TPU and software, perhaps. But it would mean abandoning 18 years of CUDA.

Tensor cores are specialized and have CUDA support.

jauntywundrkind · 2025-11-28T02:04:47 1764295487

Tensor cores can help a lot for matrix maths, sure, definitely. They made a big splash in 2017 & have been essential. https://developer.nvidia.com/blog/programming-tensor-cores-c...

But it's still something grafted onto the existing architecture, of many grids with many blocks with many warps, and lots and lots of coordination and passing intermediary results around. It's only a 4x4x4 unit, afaik. There's still a lot of main memory being used to combine data, a lot of orchestration among the different warps and blocks and grids, to get big matrices crunched.

The systolic array is designed to allow much more fire and forget operations. It's inputs are 128 x 128 and each cell is its own compute node basically, shuffling data through and across (but not transitting a far off memory).

TPU architecture has plenty of limitations. It's not great at everything. But if you can design work to flow from cell to neighboring cell, you can crunch very sizable chunks of data with amazing data locality. The efficiency there is unparalleled.

Nvidia would need a radical change of their architecture to get anything like the massive data locality wins a systolic array can do. It would come with massively more constraints too.

Would love if anyone else has recommended reading. I have this piece earmarked. https://henryhmko.github.io/posts/tpu/tpu.html https://news.ycombinator.com/item?id=44342977