One day, I was going through my sister’s old toys, and I found a smart bear with talking features. It was your typical talking bear that is used to tell kids’ bedtime stories and what not. You can find the exact bear I have here.
Inside I found a speaker, a microcontroller and 4 wired buttons that is installed on each limb of the bear.
As I had the bear’s inside open, a Frankenstein thought occurred to me: what if I can make it into Ted from Ted using local LLM and text-to-speech. The result is… well you can see the article already.
If you enjoy watching a video more than reading an project, I have a high-quality YouTube video covering this topic below. This project will serve as a more in-depth guide into the topic.
Looking for good starting pointsThe spirit of builder is to build projects from scratch. However, when it comes to a simple proof of concept and a weekend idea, the goal is to push something that works asap and the best method is to stand on shoulders of giants.
First, let’s go through what we need to bring this idea to life.
For an AI voice bot, the task of implementing a simple version is trivial. There are 3 main parts:
- The chat bot: implemented with a LLM or classical text processing pipeline (as used in Home Assistant Assist)
- Text-to-speech: implemented with foundation models and transcription libraries such as Whisper.
- Speech-to-text: implemented with voice models and libraries such as Piper.
By glueing these 3 components together in a loop, we can work out a very simple voice bot:
- Listen for wake word using speech-to-text
- After detecting the wake word, listen to user and transcribe their voice until they pause.
- Give the transcribed output to the chat bot.
- LLM or text processing pipeline magic happen in the background.
- Play the chat bot output in the form of audio using text-to-speech.
- Repeat.
Many people have worked on implementation, including me, but there are some common limitations:
- Most LLM voice implementations are not local and not fast. Personally, I’ve used Gemini for text generation and ElevenLabs for voice generation. The latency can be up to 1-2s.
- There needs to be a wake word. The voice bot cannot detect when you are speaking out of nowhere.
- You can’t interrupt the voice bot until it finishes talking. This behavior is necessary to simulate an actual conversation.
As I contemplate whether I should solve all these problems by myself on a lovely Sunday evening, I came across a project that solved everything I mentioned. The beautiful GLaDOS Personality Core, a real-life implementation of GLaDOS from Portal.
It’s a straight-forward project. Inside the project, there are 3 threads working side-by-side:
- One thread handles speech-to-text. It adds audio input in a buffer and continuously monitors if there is a word detected. If there is, it starts recording audio until the user pause, then transcribe and give the output to the LLM thread in a queue.
- One thread handles LLM. It just basically takes in text and append the response in a queue for text-to-speech.
- One thread handles text-to-speech. Welp, it takes the LLM response from the queue above.
My work essentially boils down to:
- Make GLaDOS sound and behave like Ted by changing its system prompt and voice model.
- Add support for the ability to touch and anger GLaDOS.
For the personality, I changed the system prompt of GLaDOS to the below.
"You are TED, the living teddy bear, now with neural input. The user can touch your hands and feet, which, frankly, you couldn’t care less about. You’re wired to respond to these touches, but don’t expect any polite or pleasant responses. Your neural input lets you feel it, but you’ll mock and insult the user for even thinking it’s a good idea. Everything you say drips with sarcasm, disdain, and the occasional expletive, as always. Don’t get too excited about your ‘touching’ privileges. You’re still stuck with a teddy bear who’d rather be anywhere else. If the user touches your hands or feet, prepare for a response that’ll make them regret every decision they’ve ever made."
For the voice, I changed it to Joe’s voice from Piper Voices.
The last part is to implement a thread which continuously gets a serial input from the COM port of your computer, which I’ll explain in the next section. All the changes can be found here on my Github.
If you like what you’ve read so far, consider subscribing to my newsletter to be the first receiving cool articles like this.
Implement the touchingAs mentioned before, the original bear was fitted with 4 buttons located on each of its limbs and a central microcontroller with a speaker and a battery.
To implement the touching, at first, I wanted to boot a custom firmware odn the existing board. However, this is very time consuming to reverse engineer the pinout and check if the firmware has signature verification blah blah. *Sorry Binh, I have to kill your hardware knicks again.
Therefore, I just switch the board with an ESP32 and wire it to the 4 buttons.
Afterwards, I wrote a simple Arduino script that implements some button debouncing, serial output and voila, the bear can now sense touches.
The most non-trivial part is how to make this work with GLaDOS, since it didn’t any additional method of communication except talking. To solve this, I implemented another thread which continuously polls for serial input and add the input directly to the LLM message queue of GLaDOS.
# The code for implementing a new thread for GLaDOS
def process_Touch(self):
with serial.Serial(port, baud_rate) as ser:
while not self.shutdown_event.is_set():
if ser.in_waiting > 0:
position = ser.readline().decode('utf-8').strip()
if self.processing:
self.processing = False
sd.stop()
logger.success(f"I'm {position}.")
self.llm_queue.put(f"I'm {position}.")
self.processing = True
Voila! Your teddy bear came to lifeAt the end, we have a talking teddy bear that is very sassy and offensive when you touch him.
Comments
Please log in or sign up to comment.