Can anyone share how the computer's performance is impacted by running the model...

npsomaratna · on Aug 28, 2023

M1 with 32 GB RAM. I can just about fit the 4-bit quantized 33 GB Code Llama model (and it's finetunes, e.g. WizardCoder, etc.) into memory. It's somewhat slow, but good enough for my purposes.

Edit: when I bought my Macbook in 2021, I was like "Ok, I'll just take the base model and add another 16 GB of RAM. That should future proof it for at least another half-decade." Famous last words.

filmgirlcw · on Aug 28, 2023

This is why my rule for laptops with non-upgradable memory has been to max out the RAM at purchase -- and that has been my rule since 2012/2013 or whenever that trend really started.

(written from a 64GB M1 Pro Max)

joombaga · on Aug 28, 2023

I wish they offered that much in the Air.

SparkyMcUnicorn · on Aug 28, 2023

34B Q4 will use around 20GB of memory.

If it's running slow, make sure metal is actually being used[0]. You can get as much as a 50-100% boost in tokens/s, if by chance it's not enabled.

I'm averaging 7 to 8 tokens/s on an M1 Max 10 core (24 GPU cores).

[0] if using llama-cpp-python (or text-generation-webui, ollama, etc) try:

`pip uninstall llama-cpp-python && CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 pip install llama-cpp-python`

npsomaratna · on Aug 29, 2023

Thank you. I had to reduce the context length to get this to work without crashing (from 16k to 8k)—and I'm seeing the ~100% speed up you mentioned.

However, when I run the LLM, OSX becomes sluggish. I assume this is because the GPU's utilized to the point where hardware-based rendering slows down due to insufficient resources.

I wonder if there's a way to avoid that slowdown?

SparkyMcUnicorn · on Aug 29, 2023

I haven't noticed any slowdowns. Maybe check that threads/n_threads is set correctly for your machine (total cores - 2. 10 cores = 8, 8 cores = 6).

n_gpu_layers should also be set to anything other than 0 (default). I don't think the exact number matters for metal, but I use 128.

nico · on Aug 28, 2023

Are there any LLMs that run on regular (AMD/Intel) CPUs? Or does everything require at least an M1 or a decent GPU?

loudmax · on Aug 28, 2023

You can absolutely run LLMs without a GPU, but you need to set expectations for performance. Some projects to look into are

  * llama.cpp - https://github.com/ggerganov/llama.cpp
  * KoboldCpp - https://github.com/LostRuins/koboldcpp
  * GPT4All - https://gpt4all.io/index.html

llama.ccp will run LLMs that have been ported to the gguf format. If you have enough RAM, you can even run the big 70 billion parameter models. If you have a CUDA GPU, you can even offload part of the model onto the GPU and have the CPU do the rest, so you can get some partial performance benefit.

The issue is that the big models run too slowly on a CPU to feel interactive. Without a GPU, you'll get much more reasonable performance running a smaller 7 billion parameter model instead. The responses won't be as good as the larger models, but they may still be good enough to be worthwhile.

Also, development in this space is still coming extremely rapidly, especially for specialized models like ones tuned for coding.

sp332 · on Aug 28, 2023

They do run, just slowly. Still better than nothing if you want to run something larger than would fit in your VRAM though. The Llama.ccp project is the most popular runtime, but I think all the major ones have a flag like "--cpu".