Extremely cool and Justine Tunney / jart does incredible portability work [0], b...

pdntspa · on Nov 30, 2023

I don't get this obsession with 0-click everything. It is really annoying when you don't want to install everything to your main hard drive. I have all my models downloaded, organized, and ready-to-go but apps won't even ask for that, instead it presumes I am an idiot and downloads it (again!) for me.

At least Makeayo asks where my models are now. It's obnoxious that I have to use symlinks for comfy/automatic....

All they need to do is ask me where my stuff is on first run, and an area in the config to update that setting. Not so hard!

simonw · on Nov 30, 2023

Sounds like you should download the 4.45MB llamafile-server-0.1 executable from https://github.com/Mozilla-Ocho/llamafile/releases/tag/0.1 and then run it against your existing gguf model files like this:

    ./llamafile-server-0.1 -m llama-2-13b.Q8_0.gguf

See here: https://simonwillison.net/2023/Nov/29/llamafile/#llamafile-t...

mft_ · on Nov 30, 2023

If I'm understanding (and agreeing with) your gripe correctly, isn't it two solutions to the same perceived problem?

My experience is that the world of Python dependency management is a mess which sometimes works, and sometimes forces you to spend hours-to-days searching for obscure error messages and trying maybe-fixes posted in Github issues for some other package, just in case it helps. This sometimes extends further - e.g. with hours-to-days spent trying to install just-the-right-version-of-CUDA on Linux...

Anyway, the (somewhat annoying but understandable) solution that some developers take is to make their utility/app/whatever as self-contained as possible with a fresh install of everything from Python downwards inside a venv - which results in (for example) multiple copies of PyTorch spread around your HDD. This is great for less technical users who just need a minimal-difficulty install (as IME it works maybe 80-90% of the time), good for people who don't want to spend their time debugging incompatibilities between different library versions, but frustrating for the more technically-inclined user.

This is just another approach to the same problem, which presumably also presents an even-lower level of work for the maintainers, since it avoids Python installs and packages altogether?

pdntspa · on Nov 30, 2023

I get that, my issue is when the model is coupled with the app, or the app just presumes I don't have it downloaded and doesn't ask me otherwise. This is like basic configuration stuff...

What I suspect is happening is that people are cargo-culting zero-click installations. It seems rather fashionable right now.

rgbrgb · on Nov 30, 2023

I don’t think making it easy to install is cargo-culting. In my case it’s an accessibility thing. I wanted a private alternative that I could give to nontechnical people in my life who had started using ChatGPT. Some don’t understand local vs cloud and definitely don’t know about ggufs or LLMs but they all install apps from the App Store.

whstl · on Nov 30, 2023

In the README of the project (the TFA of this whole thread) there is the option to download the app without the model:

"You can also also download just the llamafile software (without any weights included) from our releases page, or directly in your terminal or command prompt"

There is no cargo-culting going on. Some of us do legitimately appreciate it.

pdntspa · on Nov 30, 2023

Which has been followed, and this comment was not a response to this specific app but rather a general trend I've noticed and was mentioned at the start of this thread

whstl · on Nov 30, 2023

I was answering to this complaint: "my issue is when the model is coupled with the app".

In this specific case there is an option for you that addresses this complaint, where the model isn't coupled with the app.

ElectricalUnion · on Nov 30, 2023

Is this the sentiment around?

Is having everything normalized in your system that worth it? I would say having (some) duplicates in your system is mostly fine, better that having some spooky-action-at-a-distance break things when you don't expect.

I expect the future is something like Windows's WinSxS, NixOS's /nix/store, pnpm's .pnpm-store where that deduping isn't "online" but it still is somewhat automated and hidden from you.

pdntspa · on Nov 30, 2023

> Is this the sentiment around?

Yes? It's right here, at the least.

And if that's the future, then the future sucks. We can teach people to be smarter, but no, instead our software has to bend over backwards to blow smoke up our ass because grandma.

joenot443 · on Nov 30, 2023

The “obsession” isn’t for developers like you or I, it’s for non-technical but curious non-engineers who would still like to play around with an LLM.

I think you’re imposing your own very specific (and solvable!) gripe onto an unrelated project and it makes for pretty clumsy commentary.

whstl · on Nov 30, 2023

Yep.

But also: It might not be for a developer like you, but it is for a developer like me.

I enjoy writing software, but I don't particularly enjoy futzing with building things outside my day-to-day work, and on systems I don't write myself. If it was up to me everything would be one click.

Things like this are like accessibility: it benefits me even though I don't particularly need it.

pdntspa · on Nov 30, 2023

It's the way things have been for god-knows-how-long, and it has worked really really well.

I am utterly amazed and perplexed that answering ONE configuration question (which is all that is needed here) is too much. How lazy have we become?

One question. That is too much for folks?!

And I have no objection to the autodownloading. Just ask me where to put it first!

rgbrgb · on Nov 30, 2023

fwiw FreeChat does this now. It prompts you to download or select a model to use (and you can add as many as you want). No copying or forced downloads.

stevenhuang · on Nov 29, 2023

The binaries themselves are available standalone https://github.com/Mozilla-Ocho/llamafile/releases

rgbrgb · on Nov 29, 2023

cool. this is more convenient than my workflow for doing the binaries myself. I currently use make to generate a binary of llama.cpp server on my intel iMac and my m1 MacBook then lipo them together.

coldtea · on Nov 30, 2023

>I make a small macOS app [1] which runs llama.cpp with a SwiftUI front-end. For the first version of the app I was obsessed with the single download -> chat flow and making 0 network connections. I bundled a model with the app and you could just download, open, and start using it. Easy! But as soon as I wanted to release a UI update to my TestFlight beta testers, I was causing them to download another 3GB. All 3 users complained :).

Well, that's on the MAS/TestFlight for not doing delta updates.

rgbrgb · on Nov 30, 2023

Yes, though it does seem to be working for them. They have a special feature for lazy loading large assets but I opted for a simpler to me option (giving users a button to download a model if they don’t have one locally they want to use).

Asmod4n · on Nov 29, 2023

It’s just a zip file, updating it should be doable in place while it’s running on any non windows platform and you just need to swap that one file out you changed. When it’s running in server mode you could also possibly hot reload the executable without the user even having any downtime.

csdvrx · on Nov 29, 2023

You could also change you code so that when it runs, it checks as early as possible if you have a file with a well known name (say ~/.freechat.run) and then switches to reading from it instead for the assets than can change.

You could have multiple updates my using say iso time and doing a sort (so that ~/.freechat.run.20231127120000 would be overriden by ~/.freechat.run.20231129160000 without making the user delete anything)

tbalsam · on Nov 29, 2023

> in place

._.

Pain.

wyldfire · on Nov 30, 2023

> Extremely cool ...

> I'm kind of struggling with the use-cases for this one.

IMO cosmopolitan libc is a "really neat trick". And it deserves praise and it probably does have some real use cases. But it's not practical for most purposes. If we had a format like ELF that was so fat as to support as many architectures and OSs as desired, would we be using that? I have a feeling that we would not.

Then again -- after having used "zig cc" for a while, maybe it would be reasonable to have something like "one build" that produces a mega-fat binary.

And the microarch-specific dispatch is a nice touch.

...maybe I'm convincing myself of the alternative....

larodi · on Nov 30, 2023

Perhaps another unpopular opinion that can get the comment outright down-voted, but still... While jart's work is very interesting in nature and execution, commendable stuff indeed of a person with very high IQ and discipline, I still wonder whether Justine simply can't get over the fact they got kicked out of the llama.cpp project (yes, I understand jart is frequenting HN, and also let's agree llama.cpp is at least as cool as jart's projcets). No, I'm not going in details of said dismissal, as both sides seem to have had their proper arguments, but still.

And of course, I can imagine where the whole cosmopolitan thing comes from,... even as manifest of sorts for the idea of systems-neutrality and potentially gender fluidity. But I really wonder whether GGUF actually needs this, since llama.cpp already compiles and runs pretty much everywhere.

Why introduce one more container? Who benefits from binary distribution of this sort?

ukuina · on Nov 30, 2023

> Why introduce one more container? Who benefits from binary distribution of this sort?

Mass adoption comes from ease-of-use, which includes ease-of-deployment.

Most of the HN crowd is savvy enough to run HF models directly in their terminal, but there's a reason Ollama, LM Studio, and Faraday are so popular.

Until LLM runners are part of the base OS, methods like this are extremely helpful to reduce friction.

whstl · on Nov 30, 2023

I read the Github repository README and the comments here and I found absolutely nothing that could suggest the need for the first two paragraphs you wrote. It seems this stems from a misconception from your side about the purpose of this project.

About your question in the third paragraph: This is totally orthogonal to GGUF, and a cursory reading of the README shows that it does uses GGUF. This is not about a new universal LLM format, this is about packing it in a universal executable that runs everywhere, using Cosmopolitan.

Some examples do pack the executable and GGUF weights together in a single file, but that's not dissimilar from an self-executing zip, the only difference is that this executable is not OS-specific, so you can use the same exact binary for macOS or Linux, for example.

ElectricalUnion · on Nov 30, 2023

> llama.cpp already compiles and runs pretty much everywhere.

Well, it simplifies things when you don't need to compile things.

Also, you literally can't download or compile the wrong binary by mistake, it's the same binary for all supported processor/OSes Cartesian product matrix.

> Why introduce one more container?

It makes stuff more convenient.

`application/zip` is also a ubiquitous standard. I doubt anyone is being "introduced to it".

I also appreciate the fact that tooling for handling `application/zip` is very widespread, so you don't need totally bespoke tooling to retrieve the models from inside a `llamafile`.

> Who benefits from binary distribution of this sort?

Anyone that doesn't have a compiler SDK on their computer.

halyconWays · on Nov 30, 2023

>Extremely cool and Justine Tunney / jart does incredible portability work [0],

[x] Doubt.

That user was caught stealing code and banned from llama.cpp by its creator (your [2] citation) https://news.ycombinator.com/item?id=35411909

Maybe the same thing is happening here. Plagiarism of code.

hobofan · on Nov 30, 2023

What are you on about? There was no stealing and there was no plagiarism.

They made a PR that was built on top of another PR. The authorship information was preserved in the git history, and there was no attempt at deception. They also supposedly collaborated with the author of the original PR (which was never denied by either of them). All of this is totally normal working practice.

Those allegations of "stealing" just stem from a GH user piling onto the drama from the breaking change by pointing out where the initials from the new file format come from (which wasn't called into question on the original PR).

They were also not banned for those stealing allegations. They, as well as the author from the reversal PR were banned, as the maintainer deemed the resulting "drama" from the breaking changes to be a distraction to the project goals. The maintainer accepted the PR, and the nature of the breaking changes was obviously stated, so that drama wasn't completely on jart.

halyconWays · on Nov 30, 2023

You obviously didn't read the post, which shows the code, the words of the original author, the link to the original PR, and the user jart taking credit. It also shows her not understanding what she took and ultimately being fundamentally wrong about mmap.

averne_ · on Nov 30, 2023

It's not so clear cut. The author of the original PR had serious gripes about jart's handling of the situation, especially how hard they pushed their PR, practically forcing the merge before legitimate concerns were lifted.

See this post https://news.ycombinator.com/item?id=35418066