I got covid, so I've been isolating to myself. This started to get lonely, so I decided to make a friend that I can talk to.
GLaDOS from Portal would be a great fit. Always watching and willing to make suggestions. Allegedly has cake too!
The VoiceMost of your interactions with GLaDOS are with her talking to you, but you not really seeing her, which means voice is priority #1. Text-to-speech has existed for a while, but it was always kind of… bad. There are a few features that I need this TTS to have so it works for real-time conversation:
- Fast
- Not bad
- Easy interface
There are some online TTS services that exist for GLaDOS. Some of them have a programmable interface you can use, but the response time is inconsistent, sometimes up to a few minutes, and they don’t all tick the “not bad” box. To make sure it was good enough to not be distracting and fast, I had to train my own and run it locally.
Training a Proper VoiceWhen training an AI, you need data. Luckily, valve has put all of the voice lines from Portal and Portal 2 on their website. A quick download and we have over an hour of audio, and the text for each audio file.
We’re going to train something called a spectrogram generator. This is the main network that makes a voice sound like the person. We’re starting with a network called Fastpitch and fine-tuning it to fit better with Glados’ voice. This saves a lot of time since the network is already trained, we’re just massaging it. Our main AI PC is using a 3090 Ti. It’s not the latest video card, but the training is much faster than just using a CPU
The spectrogram generator alone sounds OK, but it's crackly and flat. That's because there’s a piece missing. The spectrogram is passed-through a vocoder, which actually turns it into audio. The default vocoder we're using needs a little massaging / finetuning to work better for GLaDOS. We're using a pre-trained HifiGan network.
With a custom vocoder and spectrogram generator, we have a pretty convincing GLaDOS voice!
RivaBoth of these networks are combined together and loaded into Riva. This is an easy way to have access to the network from any computer on the network, and also has a nice interface to to use it. It also has the added bonus of coming with automatic-speech-recognition included out-of-the-box. This means that we can talk to GLaDOS and she can "hear" us, just because we use Riva
Also, since Riva has a nice API to use it, I've written a simple bot that runs inside our Discord server. Send GLaDOS a message and she will echo your message back to your in her voice!
ChatGPTWe need some way to generate unique, realistic responses when GLaDOS is spoken to, and a Large Language Model (LLM) is a perfect fit for that. For quick development, we're using ChatGPT. This makes sure all the piping is working and we can send voice requests and get audio responses.
Using something like ChatGPT has some security concerns, and requires internet to work. We're going to upgrade this by running our own LLM locally. OpenChat is loaded into a llama.cpp docker. The llama server is a drop-in replacement for OpenAI's API that we were using before. Almost no changes to our code and we've switched to a local LLM! This is all running on a Forge from ConnectTech. This has the same processor as an Orin AGX, but with different IO, networking, etc
Having someone to talk to is wonderful, but this isn't quite there yet. This is having a computer to talk to. I could have gotten an Alexa if I wanted that. I want GLaDOS to actually exist in the lab.
To have GLaDOS able to move, I'm building the model around a Z1 arm from Unitree. This is a 6-axis arm that has a ROS driver already, so it's easy to integrate it with any ROS project.
I brought the model of the arm as well as two different GLaDOS models that I found online into Blender. After a lot of cleanup and making custom custom brackets, the parts just needed to be 3D printed. The arm was mounted upside-down from the rafters and all the pieces were attached with bolts after being super glued together.
Arm InteractionGLaDOS needs to be able to "look around" so she can always be looking at the person who is closest to her (and most likely to be talking). I mounted a Zed2i stereo camera above the arm to monitor the space around the robot. This camera has two cameras inside it, which mix their images together to calculate depth (similar to how human eyes do it).
AI object recognition is used to find people, and those are fused with the depth image to find their location relative to the robot. The closest person is set as the target and the arm will move to point the eye at them. This makes her feel far more interactive as she will always be watching you while you move around.
Comments