Extremely cool and Justine Tunney / jart does incredible portability work [0], but I'm kind of struggling with the use-cases for this one.
I make a small macOS app [1] which runs llama.cpp with a SwiftUI front-end. For the first version of the app I was obsessed with the single download -> chat flow and making 0 network connections. I bundled a model with the app and you could just download, open, and start using it. Easy! But as soon as I wanted to release a UI update to my TestFlight beta testers, I was causing them to download another 3GB. All 3 users complained :). My first change after that was decoupling the default model download and the UI so that I can ship app updates that are about 5MB. It feels like someone using this tool is going to hit the same problem pretty quick when they want to get the latest llama.cpp updates (ggerganov SHIIIIPS [2]). Maybe there are cases where that doesn't matter, would love to hear where people think this could be useful.
I don't get this obsession with 0-click everything. It is really annoying when you don't want to install everything to your main hard drive. I have all my models downloaded, organized, and ready-to-go but apps won't even ask for that, instead it presumes I am an idiot and downloads it (again!) for me.
At least Makeayo asks where my models are now. It's obnoxious that I have to use symlinks for comfy/automatic....
All they need to do is ask me where my stuff is on first run, and an area in the config to update that setting. Not so hard!
If I'm understanding (and agreeing with) your gripe correctly, isn't it two solutions to the same perceived problem?
My experience is that the world of Python dependency management is a mess which sometimes works, and sometimes forces you to spend hours-to-days searching for obscure error messages and trying maybe-fixes posted in Github issues for some other package, just in case it helps. This sometimes extends further - e.g. with hours-to-days spent trying to install just-the-right-version-of-CUDA on Linux...
Anyway, the (somewhat annoying but understandable) solution that some developers take is to make their utility/app/whatever as self-contained as possible with a fresh install of everything from Python downwards inside a venv - which results in (for example) multiple copies of PyTorch spread around your HDD. This is great for less technical users who just need a minimal-difficulty install (as IME it works maybe 80-90% of the time), good for people who don't want to spend their time debugging incompatibilities between different library versions, but frustrating for the more technically-inclined user.
This is just another approach to the same problem, which presumably also presents an even-lower level of work for the maintainers, since it avoids Python installs and packages altogether?
I get that, my issue is when the model is coupled with the app, or the app just presumes I don't have it downloaded and doesn't ask me otherwise. This is like basic configuration stuff...
What I suspect is happening is that people are cargo-culting zero-click installations. It seems rather fashionable right now.
I don’t think making it easy to install is cargo-culting. In my case it’s an accessibility thing. I wanted a private alternative that I could give to nontechnical people in my life who had started using ChatGPT. Some don’t understand local vs cloud and definitely don’t know about ggufs or LLMs but they all install apps from the App Store.
In the README of the project (the TFA of this whole thread) there is the option to download the app without the model:
"You can also also download just the llamafile software (without any weights included) from our releases page, or directly in your terminal or command prompt"
There is no cargo-culting going on. Some of us do legitimately appreciate it.
Which has been followed, and this comment was not a response to this specific app but rather a general trend I've noticed and was mentioned at the start of this thread
Is having everything normalized in your system that worth it? I would say having (some) duplicates in your system is mostly fine, better that having some spooky-action-at-a-distance break things when you don't expect.
I expect the future is something like Windows's WinSxS, NixOS's /nix/store, pnpm's .pnpm-store where that deduping isn't "online" but it still is somewhat automated and hidden from you.
And if that's the future, then the future sucks. We can teach people to be smarter, but no, instead our software has to bend over backwards to blow smoke up our ass because grandma.
But also: It might not be for a developer like you, but it is for a developer like me.
I enjoy writing software, but I don't particularly enjoy futzing with building things outside my day-to-day work, and on systems I don't write myself. If it was up to me everything would be one click.
Things like this are like accessibility: it benefits me even though I don't particularly need it.
fwiw FreeChat does this now. It prompts you to download or select a model to use (and you can add as many as you want). No copying or forced downloads.
cool. this is more convenient than my workflow for doing the binaries myself. I currently use make to generate a binary of llama.cpp server on my intel iMac and my m1 MacBook then lipo them together.
>I make a small macOS app [1] which runs llama.cpp with a SwiftUI front-end. For the first version of the app I was obsessed with the single download -> chat flow and making 0 network connections. I bundled a model with the app and you could just download, open, and start using it. Easy! But as soon as I wanted to release a UI update to my TestFlight beta testers, I was causing them to download another 3GB. All 3 users complained :).
Well, that's on the MAS/TestFlight for not doing delta updates.
Yes, though it does seem to be working for them. They have a special feature for lazy loading large assets but I opted for a simpler to me option (giving users a button to download a model if they don’t have one locally they want to use).
It’s just a zip file, updating it should be doable in place while it’s running on any non windows platform and you just need to swap that one file out you changed. When it’s running in server mode you could also possibly hot reload the executable without the user even having any downtime.
You could also change you code so that when it runs, it checks as early as possible if you have a file with a well known name (say ~/.freechat.run) and then switches to reading from it instead for the assets than can change.
You could have multiple updates my using say iso time and doing a sort (so that ~/.freechat.run.20231127120000 would be overriden by ~/.freechat.run.20231129160000 without making the user delete anything)
> I'm kind of struggling with the use-cases for this one.
IMO cosmopolitan libc is a "really neat trick". And it deserves praise and it probably does have some real use cases. But it's not practical for most purposes. If we had a format like ELF that was so fat as to support as many architectures and OSs as desired, would we be using that? I have a feeling that we would not.
Then again -- after having used "zig cc" for a while, maybe it would be reasonable to have something like "one build" that produces a mega-fat binary.
And the microarch-specific dispatch is a nice touch.
...maybe I'm convincing myself of the alternative....
Perhaps another unpopular opinion that can get the comment outright down-voted, but still... While jart's work is very interesting in nature and execution, commendable stuff indeed of a person with very high IQ and discipline, I still wonder whether Justine simply can't get over the fact they got kicked out of the llama.cpp project (yes, I understand jart is frequenting HN, and also let's agree llama.cpp is at least as cool as jart's projcets). No, I'm not going in details of said dismissal, as both sides seem to have had their proper arguments, but still.
And of course, I can imagine where the whole cosmopolitan thing comes from,... even as manifest of sorts for the idea of systems-neutrality and potentially gender fluidity. But I really wonder whether GGUF actually needs this, since llama.cpp already compiles and runs pretty much everywhere.
Why introduce one more container? Who benefits from binary distribution of this sort?
I read the Github repository README and the comments here and I found absolutely nothing that could suggest the need for the first two paragraphs you wrote. It seems this stems from a misconception from your side about the purpose of this project.
About your question in the third paragraph: This is totally orthogonal to GGUF, and a cursory reading of the README shows that it does uses GGUF. This is not about a new universal LLM format, this is about packing it in a universal executable that runs everywhere, using Cosmopolitan.
Some examples do pack the executable and GGUF weights together in a single file, but that's not dissimilar from an self-executing zip, the only difference is that this executable is not OS-specific, so you can use the same exact binary for macOS or Linux, for example.
> llama.cpp already compiles and runs pretty much everywhere.
Well, it simplifies things when you don't need to compile things.
Also, you literally can't download or compile the wrong binary by mistake, it's the same binary for all supported processor/OSes Cartesian product matrix.
> Why introduce one more container?
It makes stuff more convenient.
`application/zip` is also a ubiquitous standard. I doubt anyone is being "introduced to it".
I also appreciate the fact that tooling for handling `application/zip` is very widespread, so you don't need totally bespoke tooling to retrieve the models from inside a `llamafile`.
> Who benefits from binary distribution of this sort?
Anyone that doesn't have a compiler SDK on their computer.
What are you on about? There was no stealing and there was no plagiarism.
They made a PR that was built on top of another PR. The authorship information was preserved in the git history, and there was no attempt at deception. They also supposedly collaborated with the author of the original PR (which was never denied by either of them). All of this is totally normal working practice.
Those allegations of "stealing" just stem from a GH user piling onto the drama from the breaking change by pointing out where the initials from the new file format come from (which wasn't called into question on the original PR).
They were also not banned for those stealing allegations. They, as well as the author from the reversal PR were banned, as the maintainer deemed the resulting "drama" from the breaking changes to be a distraction to the project goals. The maintainer accepted the PR, and the nature of the breaking changes was obviously stated, so that drama wasn't completely on jart.
You obviously didn't read the post, which shows the code, the words of the original author, the link to the original PR, and the user jart taking credit. It also shows her not understanding what she took and ultimately being fundamentally wrong about mmap.
It's not so clear cut. The author of the original PR had serious gripes about jart's handling of the situation, especially how hard they pushed their PR, practically forcing the merge before legitimate concerns were lifted.
I make a small macOS app [1] which runs llama.cpp with a SwiftUI front-end. For the first version of the app I was obsessed with the single download -> chat flow and making 0 network connections. I bundled a model with the app and you could just download, open, and start using it. Easy! But as soon as I wanted to release a UI update to my TestFlight beta testers, I was causing them to download another 3GB. All 3 users complained :). My first change after that was decoupling the default model download and the UI so that I can ship app updates that are about 5MB. It feels like someone using this tool is going to hit the same problem pretty quick when they want to get the latest llama.cpp updates (ggerganov SHIIIIPS [2]). Maybe there are cases where that doesn't matter, would love to hear where people think this could be useful.
[0]: https://justine.lol/cosmopolitan/
[1]: https://www.freechat.run
[2]: https://github.com/ggerganov/llama.cpp