Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

M1 with 32 GB RAM. I can just about fit the 4-bit quantized 33 GB Code Llama model (and it's finetunes, e.g. WizardCoder, etc.) into memory. It's somewhat slow, but good enough for my purposes.

Edit: when I bought my Macbook in 2021, I was like "Ok, I'll just take the base model and add another 16 GB of RAM. That should future proof it for at least another half-decade." Famous last words.



This is why my rule for laptops with non-upgradable memory has been to max out the RAM at purchase -- and that has been my rule since 2012/2013 or whenever that trend really started.

(written from a 64GB M1 Pro Max)


I wish they offered that much in the Air.


34B Q4 will use around 20GB of memory.

If it's running slow, make sure metal is actually being used[0]. You can get as much as a 50-100% boost in tokens/s, if by chance it's not enabled.

I'm averaging 7 to 8 tokens/s on an M1 Max 10 core (24 GPU cores).

[0] if using llama-cpp-python (or text-generation-webui, ollama, etc) try:

`pip uninstall llama-cpp-python && CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 pip install llama-cpp-python`


Thank you. I had to reduce the context length to get this to work without crashing (from 16k to 8k)—and I'm seeing the ~100% speed up you mentioned.

However, when I run the LLM, OSX becomes sluggish. I assume this is because the GPU's utilized to the point where hardware-based rendering slows down due to insufficient resources.

I wonder if there's a way to avoid that slowdown?


I haven't noticed any slowdowns. Maybe check that threads/n_threads is set correctly for your machine (total cores - 2. 10 cores = 8, 8 cores = 6).

n_gpu_layers should also be set to anything other than 0 (default). I don't think the exact number matters for metal, but I use 128.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: