Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I tried that with RTX 4090 as the primary card and 3090 as eGPU over Thunderbolt. It works, but the inference is very slow, presumably because it has to pump all that data back and forth between the two (and Thunderbolt isn't fast enough to keep up even with 3090 by itself in games). In fact, even running 30B across two GPUs in 8-bit mode like that was slower than running it on one GPU in 4-bit.

My takeaway is that if you actually want to use multiple GPUs, you need hardware that's designed to accommodate that, and most consumer-grade stuff, even high-end, is not built with two GPUs that physically large in mind.



I’ve got 2 rtx 4090s on an EATX motherboard. Been using them to run the full 13B un-quantized with a good deal of success. Getting about 20 tokens/s.


What is your setup for cooling? I don't think I'd want to stick another 4090-size card in mine with just air cooling...


I have AIO liquid cooling for both cards. The radiators are annoying though, I might convert it to a custom loop if I ever add a third card.


Looking at the top end H100 80GB systems with NVLink from HPC vendors and it occurred to me we are about to swing back to massive almost mainframe like form-factor systems, giant bus, like old expandable qbus in the 80s but this time for GPUs.

What I mean is they have systems with 8x cards but given the compute requirements of these huge LLMs probably systems with 32+ all on a dedicated memory bus (NVLink) are what will be needed as weights sizes expand. This is all for inference btw, not even training, but same hold for training probably best possible interconnect between same monster systems.

I’m dreaming there might be a distributed eventually consistent partial training algorithm then that would democratize creation of these models.

In regards to smaller scale individual systems for inference, if one has resources and is fairly technical and can utilize such technology then perhaps in 5-10 years the wealthy might buy units for $50K+ that get installed in their home or something.

Really incredible developments very quickly. Apologies for the potentially inappropriately long rant to the previous comment.


The other possibility is that we'll get cards that are very specifically designed just for the LLMs, basically ditching everything that is not strictly necessary for the sake of squeezing more compute / VRAM, and perhaps optimizing around int4/int8 (the latter is apparently "good enough" for training?).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: