M1 with 32 GB RAM. I can just about fit the 4-bit quantized 33 GB Code Llama model (and it's finetunes, e.g. WizardCoder, etc.) into memory. It's somewhat slow, but good enough for my purposes.
Edit: when I bought my Macbook in 2021, I was like "Ok, I'll just take the base model and add another 16 GB of RAM. That should future proof it for at least another half-decade." Famous last words.
This is why my rule for laptops with non-upgradable memory has been to max out the RAM at purchase -- and that has been my rule since 2012/2013 or whenever that trend really started.
Thank you. I had to reduce the context length to get this to work without crashing (from 16k to 8k)—and I'm seeing the ~100% speed up you mentioned.
However, when I run the LLM, OSX becomes sluggish. I assume this is because the GPU's utilized to the point where hardware-based rendering slows down due to insufficient resources.
Edit: when I bought my Macbook in 2021, I was like "Ok, I'll just take the base model and add another 16 GB of RAM. That should future proof it for at least another half-decade." Famous last words.