Overview
This project takes a lot of off-the-shelf components, puts them together, and produces a private, secure, and simple AI companion that answers any question I throw at it.
My AI assistant uses a similar approach to other, successful voice recognition systems (namely Rhasspy). In my case, however, the Raspberry Pi 4B runs a voice recognition software (called VOSK) locally and interfaces with a large language model hosted on one of my PCs through an OpenAI-compliant API endpoint (Ollama). Thanks to NordVPN’s Meshnet I can, and actually do, that from anywhere in the world.
Here’s a short video that shows my AI assistant hard at work answering random questions about Paris’ famous building
Getting startedThis guide will show how I built my own AI assistant. I already mentioned most of the things I used for this project. However, there is still a lot to cover.
The project’s repository is available here: https://github.com/RoseyWasTaken/ASR-AI
The project’s structure looks something like this:
.
├── README.md
├── assistant.py
├── gui.py
├── images
│ ├── sleeping
│ │ ├── Sleeping1.png
│ │ └── Sleeping2.png
│ ├── speaking
│ │ ├── speaking1.png
│ │ └── speaking2.png
│ └── thinking
│ ├── thinking1.png
│ └── thinking2.png
└── requirements.txt
Setting up Ollama
Ollama allows setting up language models with a couple of clicks. It’s unbelievable how streamlined the process has become. There even are installers for MacOS and Windows, ready to be downloaded and made use of.
With a fairly new and powerful graphics card, anyone can take advantage of GPU acceleration, making the model’s response times significantly quicker. However, even on a midrange CPU (like the Ryzen 5 5600G APU I’ve got in my home lab server) the responses are nearly instant, especially with smaller language models.
I went with a Docker container for my installation method, as I am fairly familiar with it and use it for other containerized services.
For inference running on CPU only:
docker run -d -v ollama:/root/.ollama -p 11434:11434 - name ollama ollama/ollama
Next, download and interact with the model:
docker exec -it ollama ollama run llama3
You can now chat with the model here.
There are other models available. Llama3 is great but runs a lot better on GPUs. Consider running Phi3, as it’s super lightweight and runs great on a CPU.
See the official Ollama repository — Ollama/Ollama
To exit the interactive shell:
/bye
That’s about it for setting up the Ollama.
Working with VOSK
VOSK has binding for all popular languages and even though I’m more familiar with JavaScript, I chose to go with Python as it seems easier to set up.
The repository has many great examples that are helpful to start working with the library.
I used the ‘test_microphone.py’ example available here: https://github.com/alphacep/vosk-api/blob/master/python/example/test_microphone.py
#!/usr/bin/env python3
# prerequisites: as described in https://alphacephei.com/vosk/install and also python module `sounddevice` (simply run command `pip install sounddevice`)
# Example usage using Dutch (nl) recognition model: `python test_microphone.py -m nl`
# For more help run: `python test_microphone.py -h`
import argparse
import queue
import sys
import sounddevice as sd
from vosk import Model, KaldiRecognizer
q = queue.Queue()
def int_or_str(text):
"""Helper function for argument parsing."""
try:
return int(text)
except ValueError:
return text
def callback(indata, frames, time, status):
"""This is called (from a separate thread) for each audio block."""
if status:
print(status, file=sys.stderr)
q.put(bytes(indata))
parser = argparse.ArgumentParser(add_help=False)
parser.add_argument(
"-l", "--list-devices", action="store_true",
help="show list of audio devices and exit")
args, remaining = parser.parse_known_args()
if args.list_devices:
print(sd.query_devices())
parser.exit(0)
parser = argparse.ArgumentParser(
description=__doc__,
formatter_class=argparse.RawDescriptionHelpFormatter,
parents=[parser])
parser.add_argument(
"-f", "--filename", type=str, metavar="FILENAME",
help="audio file to store recording to")
parser.add_argument(
"-d", "--device", type=int_or_str,
help="input device (numeric ID or substring)")
parser.add_argument(
"-r", "--samplerate", type=int, help="sampling rate")
parser.add_argument(
"-m", "--model", type=str, help="language model; e.g. en-us, fr, nl; default is en-us")
args = parser.parse_args(remaining)
try:
if args.samplerate is None:
device_info = sd.query_devices(args.device, "input")
# soundfile expects an int, sounddevice provides a float:
args.samplerate = int(device_info["default_samplerate"])
if args.model is None:
model = Model(lang="en-us")
else:
model = Model(lang=args.model)
if args.filename:
dump_fn = open(args.filename, "wb")
else:
dump_fn = None
with sd.RawInputStream(samplerate=args.samplerate, blocksize = 8000, device=args.device,
dtype="int16", channels=1, callback=callback):
print("#" * 80)
print("Press Ctrl+C to stop the recording")
print("#" * 80)
rec = KaldiRecognizer(model, args.samplerate)
while True:
data = q.get()
if rec.AcceptWaveform(data):
print(rec.Result())
else:
print(rec.PartialResult())
if dump_fn is not None:
dump_fn.write(data)
except KeyboardInterrupt:
print("\nDone")
parser.exit(0)
except Exception as e:
parser.exit(type(e).__name__ + ": " + str(e))
And while I do appreciate the authors taking their time to provide a robust example, I wanted a little thinner approach.
Hence, the following code:
import queue, sys, json, requests
import sounddevice as sd
from vosk import Model, KaldiRecognizer
from gui import display_face
q = queue.Queue()
mic = 3
device_info = sd.query_devices(mic, "input")
samplerate = int(device_info["default_samplerate"])
model = Model(model_path="vosk-model-small-en-us-0.15")
def callback(indata, frames, time, status):
"""This is called (from a separate thread) for each audio block."""
if status:
print(status, file=sys.stderr)
q.put(bytes(indata))
def voice_inference(rec):
while True:
data = q.get()
if rec.AcceptWaveform(data):
inferrence = json.loads(rec.Result())
text = inferrence.get("text")
if text:
return text
def modelRequest(message):
url = "http://100.102.228.152:11434/api/generate"
body = {
"model": "llama3",
"prompt": str(message + " Be very brief. In one sentence."),
"stream": False # responds once the entire response is ready
}
x = requests.post(url, json=body)
response = json.loads(x.text)
return response.get("response")
try:
with sd.RawInputStream(samplerate=samplerate, blocksize=8000, device=mic, # adjust device if needed
dtype="int16", channels=1, callback=callback):
rec = KaldiRecognizer(model, samplerate)
display_face("sleeping")
while True:
wakeWord = voice_inference(rec)
print(wakeWord)
if wakeWord == "hey robot":
display_face("thinking")
print(wakeWord)
print("Wake word detected.")
heard = voice_inference(rec)
print("I heard: " + str(heard))
response = modelRequest(heard)
display_face("speaking", text=response)
print(response)
else:
print("Sleeping")
display_face("sleeping")
except KeyboardInterrupt:
print("\nDone")
exit(0)
I am by no means a well-seasoned developer, but I can work things out given enough time, so the above code might not be the cleanest or the most efficient, it does the job.
Adding a GUI
Multiple libraries can take care of creating a familiar GUI. My choice was pygame, however, TkInter seems to be a popular choice as well.
import pygame
import os
imagePath = "images"
X = 480
Y = 320
sprites = {}
# sprites = {"sleeping": [pygame.image1, pygame.image2],
# "thinking": [pygame.image3, pygame.image4], ...
# }
# sprites["thinking"] = "pygame.images"
def init_pygame():
pygame.init()
window = pygame.display.set_mode((X, Y))
pygame.display.set_caption('AI Assistant')
return window
def load_sprites():
for i in os.listdir(imagePath):
currentDir = []
for j in os.listdir(os.path.join(imagePath, i)):
image = pygame.image.load(os.path.join(imagePath, i, j))
image = pygame.transform.scale(image, (X, Y))
currentDir.append(image)
sprites[i] = currentDir
return sprites
window = init_pygame()
sprites = load_sprites()
def display_face(face, text=""):
displayImage = sprites[face][0]
window.blit(displayImage, (0, 0))
if face == "speaking":
window.fill((255,255,255))
font_size = 36
font_color = (0, 0, 0)
line_spacing = 1.2 # Adjust this value to set the spacing between lines
pygame.font.init()
font = pygame.font.Font(None, font_size)
# Split the text into lines based on a maximum line width
max_line_width = X - 20 # Adjust this value as needed
lines = []
current_line = ""
for word in text.split():
test_line = current_line + word + " "
test_width, _ = font.size(test_line)
if test_width <= max_line_width:
current_line = test_line
else:
lines.append(current_line)
current_line = word + " "
lines.append(current_line)
# Calculate the total text height
total_height = len(lines) * int(font_size * line_spacing)
# Position the text on the window
text_y = ((Y - total_height) // 2 )
for line in lines:
text_surface = font.render(line, True, font_color)
text_x = (X - text_surface.get_width()) // 2
window.blit(text_surface, (text_x, text_y))
text_y += int(font_size * line_spacing)
pygame.display.update()
# Check for any pygame events (necessary to allow window to close)
for event in pygame.event.get():
if event.type == pygame.QUIT:
running = False
pygame.quit()
return
Networking
Since one of the points of this project was for the device to be portable, I needed a way to access Ollama’s API easily.
There are multiple options: port forwarding with reverse proxy and dynamic DNS, setting up Wireguard on the satellite and the server, getting a static IP and forwarding ports, etc.
However, I chose NordVPN’s implementation of Wireguard-based Meshnet, which creates direct Wireguard tunnels between devices. This way, I don’t have to pay for a domain name, static IP, set up port forwarding, or a dynamic DNS. All you need to do is download the NordVPN client and log in. No fees, no subscriptions, and no complicated setup.
Additionally, it’s available on all major platforms (Windows, MacOS, Linux, Android, and iOS) and comes with an open-source Linux client.
There is a handy command that downloads and installs the Linux client:
sh <(curl -sSf https://downloads.nordcdn.com/apps/linux/install.sh)
Once downloaded and installed, the easiest way to log in is to use a token. There is a straightforward guide available — How to use a token with NordVPN on Linux
The same goes for the host machine, regardless if it’s Linux, MacOS, or Windows-based.
All you need to do is install the NordVPN client and log in to the same account.
Hardware
With a working AI API, automatic speech recognition, and a simple GUI to show us the results, it’s time for the hardware side of things.
A lightweight Linux-based single-board computer with at least 4GBs should be enough to handle sending POST requests, and a simple UI. However, keeping the project budget-friendly is also a key point, so a cheap GPIO screen is a much better choice than a portable screen with an HDMI port.
The obvious choice is a Raspberry Pi.
Add a generic 3.5" GPIO screen with or without touch.
Finally, a microphone, as RPIs don’t come with one built in. Again, staying budget-friendly means that a microphone array for around $90.00 is out of the question.
With the help of Charles Rouchon’s — Benchmarking Microphone Arrays article, I settled down on PS Eye (not to be confused with the PS EyeToy). While it’s probably better known for being a camera, it also sports a 4-microphone array for the price of around $4.00 (shipping not included).
I designed a 3D-printed shell and put it together like a fine BLT sandwich (Bicrophone, Lcreen, TaspberryPi anyone?) while using a liberal amount of electrical tape to make sure none of the components could get to know each other too intimately.
Putting everything together required a bit of hackery since the case I’ve designed is quite small and doesn’t allow for anything to be plugged into the USB-A headers.
I spent a couple of minutes probing the pins on the underside of the USB headers to find what they were responsible for.
Once I knew which pin was responsible for what, I soldered PS Eye’s wires directly to the pins.
As mentioned earlier, there isn’t too much room to spare inside the case. A liberal amount of electrical tape is advised.
Comments