Hacker Newsnew | past | comments | ask | show | jobs | submit | nkaz123's commentslogin

Yeah essentially this. The irony of about wanting to obscure information by submitting it to a model API isn't lost on me, but it was the easiest way I could think of. Wanted some way of making the most key content in my picture to be the only thing unblurred


I accepted the bugginess in the browser game as unavoidable, and probably had too much faith in the LLM implementations, but I did a bit more troubleshooting than mentioned. The progressive improvement over episodes (and intuitively that PPO > the others) gave me some confidence, and I've since used a similar setup on 2048 with more results showing improvement over episode: https://wandb.ai/noahpunintended/2048-raspberry-pi?nw=nwuser...


I recently updated my .github.io to route to a domain name I purchased so that could be why it's getting blocked right now.

This comment alone has more tamagotchi lore than my post as a disclaimer in case I saved you a read haha.


I'm sorry for the clickbait


Inspired by Bill and Ted’s Excellent Adventure (https://www.youtube.com/watch?v=GiynF8NQzgo), I created an alternative to AI agents where to do a task, simply provide a future worker with some details like description, time, and place. Then assuming time travel is invented, someone will come back and do it and be paid based on the interest accrued over the original payment.


There are two models at use here, whisper tiny for transcribing audio, and then llama 3 for responding.

Whisper tiny is multi lingual (though I am using the english specific variant) and I believe llama 3 is technically capable of multi-lingual, but not sure of any benchmarks.

I think it could be made better, but for now focus is english. I'll add this to the readme though. Thanks!


I'm really inspired reading this and hope it can help! I'm planning to put more work into this. I have a few rough demos of it in action on youtube (https://www.youtube.com/watch?v=OryGVbh5JZE) which should give you an idea of the quality of it at the moment.


I think so. I plan to update the README in the coming days with more info, but realistically this is something you could run on your laptop too, barring some changes to how it accesses the microphone and camera. I assume this is the case too for other boards.

The only thing which might pose an issue is the total RAM size needed for whatever LLM is responsible for responding to you, but there's a wide variety of those available on ollama, Hugging Face, etc. that can work.


Thanks. My ideal use case would probably involve a mini PC, of which I have a few spares and most of them are faster and can be upgraded to more RAM than the Pi5, which especially in this context would be welcome.


I watched the demo, to be honest if I saw it sooner I probably would have tried to start this as a fork from there. Any idea what the issue was?


No, but the core concept had changed so much, between the firmware version when I bought it, and what it is now, that I'm not surprised if the upgrade is a buggy process.

It's a shame, because I only wanted a tenth of what it could do- I just wanted it to send text to a REST server. That's all! And I had it working! But I saw there was a major firmware update, and to make a long story short, KABOOM.


I'm the founder of Willow.

Early versions sent requests directly from devices and we found that to be problematic/inflexible for a variety of reasons.

Our new architecture uses the Willow Application Server (WAS) with a standard generic protocol to devices and handles the subtleties of the various command endpoints (Home Assistant, REST, OpenHAB, etc). This can be activated in WAS with the "WAS Command Endpoint" configuration option.

This approach also enables a feature we call Willow Auto Correct (WAC) that our users have found to be a game-changer in the open source voice assistant space.

I don't want to seem like I'm selling or shilling Willow, I just have a lot of experience with this application and many hard-learned lessons over the past two decades.


I'm the founder of Willow.

Generally speaking the approach of "run this on your Pi with distro X" has a surprising number of issues that make it, umm, "rough" for this application. As anyone who has tried to get reliable low-latency audio with Linux can attest the hoops to get a standard distro (audio drivers, ALSA, pulseaudio/pipewire, etc) to work well are many. This is why all commercial solutions using similar architectures/devices build the "firmware" from the kernel up with stuff like Yocto.

More broadly, many people have tried this approach (run everything on the Pi) and frankly speaking it just doesn't have a chance to be even remotely competitive with commercial solutions in terms of quality and responsiveness. Decent quality speech recognition with Whisper (as one example) itself introduces significant latency.

We (and our users) have determined over the past year that in practice Whisper medium is pretty much the minimum for reliable transcription for typical commands under typical real-world conditions. If you watch a lot of these demo videos you'll notice the human is very close to the device and speaking very carefully and intently. That's just not the real world.

We haven't benchmarked Whisper on the Pi 5 but the HA community has and they find it to be about 2x in performance from the Pi 4 for Whisper. If you look at our benchmarks[0] that means you'll be waiting at least 25 seconds just for transcription with typical voice commands and Whisper medium.

At that point you can find your phone, unlock, open an app, and just do it yourself faster with a fraction of the frustration - if you have to repeat yourself even once you're looking at one minute for an action to complete.

The hardware just isn't there. Maybe on the Raspberry Pi 10 ;).

We have a generic software-only Willow client for "bring your distro and device" and it does not work anywhere close to as well as our main target of ESP32-S3 BOX devices based devices (acoustic engineering, targeted audio drivers, real time operating system, known hardware). That's why we haven't released it.

Many people also want multiple devices and this approach becomes even more cumbersome when you're trying to coordinate wake activation and functionality between all of them. You then also end up with a fleet of devices with full-blown Linux distros that becomes difficult to manage.

Using random/generic audio hardware itself also has significant issues. Many people in the Home Assistant community that have tried the "just bring a mic and speaker!" approach end up buying > $50 USB speakerphone devices with the acoustic engineering necessary for reliable far-field audio. They are often powered by XMOS chips that themselves cost a multiple of the ESP32-S3 (as one example). So at this point you're in the range of > $150 per location for a solution that is still nowhere near the experience of Echo, Google Home, or Willow :).

I don't want to seem critical of this project but I think it's important to set very clear expectations before people go out and spend money thinking it's going to resemble anything close to

[0] - https://heywillow.io/components/willow-inference-server/#com...


Fully depends on the model, how much conversational context you provide, but if you keep things to a bare minimum, ~< 5 seconds from message received to starting the response using Llama 3 8B. I'm also using a vision language model, https://moondream.ai/, but that takes around 45 seconds so the next idea is to take a more basic image captioning model and insert it's output into context and try to cut that time down even more.

I also tried using Vulkan, which is supposedly faster, but the times were a bit slower than normal CPU for Llama CPP.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: