Thank you for creating this demo. This was the point I was trying to make when the Gemini launch happened. All that hoopla for no reason.
Yes - GPT-4V is a beast. I’d even encourage anyone who cares about vision or multi-modality to give LLaVA a serious shot (https://github.com/haotian-liu/LLaVA). I have been playing with the 7B q5_k variant last couple of days and I am seriously impressed with it. Impressed enough to build a demo app/proof-of-concept for my employer (will have to check the license first or I might only use it for the internal demo to drive a point).
It's so great. I've been this vision model to rename all the files in my Pictures folder. For example, the one-liner:
llamafile --temp 0 \
--image ~/Pictures/lemurs.jpg \
-m llava-v1.5-7b-Q4_K.gguf \
--mmproj llava-v1.5-7b-mmproj-Q4_0.gguf \
--grammar 'root ::= [a-z]+ (" " [a-z]+)+' \
-p $'### User: What do you see?\n### Assistant: ' \
--silent-prompt 2>/dev/null |
sed -e's/ /_/' -e's/$/.jpg/'
Prints to standard output:
a_baby_monkey_on_the_back_of_a_mother.jpg
This is something that's coming up in the next llamafile release. You have to build from source to have the ability to use grammar and --silent-prompt on a vision model right now.
Truly grateful for your work on cosmopolitan, cosmo libc, redbean, nudging POSIX towards realizing the unachieved dream and also for contributing to llama.cpp. It’s like wherever I look, you’ve already left your mark there!
To me, you exemplify and embody the spirit of OSS, and to top that - you seem to be just an amazing human. You are an inspiration for me and many others. And even though I know I’ll never ever get close, you make me want to try. Thank you. :)
That's cool! I've been a fan of your projects here since redbean was released, and if I understood C I would be more excited about the underlying tech that runs all these tools, but I'm more of an algorithm designer and back-end data processing system programmer (I use Python), so watching the progression of your technology is very impressive but I barely understand how it works :)
Yes - GPT-4V is a beast. I’d even encourage anyone who cares about vision or multi-modality to give LLaVA a serious shot (https://github.com/haotian-liu/LLaVA). I have been playing with the 7B q5_k variant last couple of days and I am seriously impressed with it. Impressed enough to build a demo app/proof-of-concept for my employer (will have to check the license first or I might only use it for the internal demo to drive a point).