A New Vision for Voice Assistants

Pi-card, a DIY voice assistant with eyes, uses a Raspberry Pi 5, speaker, mic, and camera to run a vision language model entirely locally.

Nick Bild
6 months agoMachine Learning & AI
Pi-card is an LLM-based voice assistant with vision (📷: Noah Kasmanoff)

Ever since large language models (LLMs) rose to prominence, it was clear that they were the perfect technology to power voice assistants. Given their understanding of natural language, vast knowledge of the world, and human-like conversational abilities, everyone knew that this combination would be the best thing since peanut butter and jelly first met on the same slice of bread. Unfortunately, few commercial products have caught up with what consumers want and still rely on older technologies.

In all fairness, LLMs may be slow to roll out to voice assistants due to the massive amount of processing power that is required to execute them, making the business model more than a little bit muddy. Hardware hackers do not have these same concerns, so have turned their impatience into action. Many DIY LLM-powered voice assistants have been created in the past couple years, and we have covered lots of them here at Hackster News (see here and here). Now that reasonably powerful LLMs can run on even constrained platforms like the Raspberry Pi, the pace at which these new voice assistants are being cranked out is heating up.

A voice assistant with eyes

The latest entry into the field, created by a data scientist named Noah Kasmanoff, has some interesting features that make it stand out. Called Pi-card (for Raspberry Pi - Camera Audio Recognition Device, and also a forced Star Trek reference), this voice assistant runs 100 percent locally on a Raspberry Pi 5 single board computer. As expected, the usual equipment for a voice assistant is also there — a speaker and a microphone. But interestingly, Pi-card also comes equipped with a camera.

The assistant waits for a configurable wake word (“hey assistant” by default) then starts recording the user’s voice. The recording is transcribed to text, then passed into a locally-running LLM as a text prompt. Responses are fed into text-to-speech software, then played over the speaker to provide an audible response. A nice feature is that the interactions are not one and done. Rather, a conversation can build up over time, and previous elements of the talk can be referenced. The conversation will continue until a keyword, such as “goodbye,” is spoken to end it.

The LLM selected by Kasmanoff is actually a vision language model, which is where the camera comes in. With Pi-card, it is possible to ask the assistant “what do you see” to trigger an image capture, which the vision language model will then explain. Not bad at all for a local setup.

Want a better option? Make your own!

The glue logic is written in Python, which calls whisper.cpp for transcription services, and llama.cpp to run the LLM. The Moondream2 vision language model was used in this case, but there is room to customize that to each user’s preferences. By using the C++ implementations of these tools, maximum execution speed is ensured.

Setup is very simple on the hardware side — just a few wires to plug in. As for the software, the code is available in a GitHub repository, and there are instructions as well that should make it pretty easy to get things up and running quickly. Kasmanoff admits that the assistant is only somewhat helpful, and that it is not especially fast, but improvements are in the works, so be sure to bookmark this one to check back later.

Nick Bild
R&D, creativity, and building the next big thing you never knew you wanted are my specialties.
Latest articles
Sponsored articles
Related articles
Latest articles
Read more
Related articles