Don’t you hate it when you need to find the thingamajig, doodad, doohickey or doojigger that you know you have, but rarely use, so you have no idea where it is?
I DO!
I try to be organized, but I have too many things, in too many places, and even though I started a database (OK, a Spreadsheet) of “Where Things Are”, it’s not usually with me when I put things away, so it’s rarely used.
Since I first saw the Start Trek Communicator Badge, I always wished I had one, or something like it. I did consider that if others didn’t, to communicate with, at least I could use it to talk to my computer. Alas, until recently it would have been too cumbersome. But now it’s possible.
Currently AI can be used for very good Speech To Text (STT), NLP and LLM can be used to make sense of the resulting “Text”, and TTS can be used to give natural(ish) sounding replies, closing the Human-Machine interaction loop.
It has been possible for a long time now to talk to Alexa, Google (BooBoo J), etc. to find out lots of information about pretty much anything, but not, as far as I know, to make lists/tables of personal items and their location in the home. Also, while using APIs to these services might enable one to do that, anything can happen to online services (not to mention potential privacy needs/wants).
So, that’s what my project is about. A Star-Trek-ish badge that I can have with me all day and I can tell it where things are, it will save that information in a database, and I can ask it later where said things are, and it will tell me. I can also open the “database” with any text editor to browse/process/edit with ease.
Mission (mostly) accomplished! Let’s get into it.
Here is a demo of the result. You can see the terminal windows in which the 3 programs run, and the “database” file as it updates in real-time on the right. You can hear the initial prompt when the system is started, them my voice asking/telling and the replies, as well as the changes in the database.
The Star-Trekish (very ish) Badge (henceforth STB) is based on an ESP32 Dev board. I’m not good with mechanicals/enclosures so it’s currently just a collection of things hanging together.
I chose a “Wemos Lolin32 Lite” because it has battery management and connector and pretty much everything I need. I added an I2S Amplifier (also transforms digital I2S signals into analog with loudspeaker output, Max98357 breakout board) and an I2S microphone (INMP441 breakout board). I had used the same hardware for a “Home Assistant” “Voice Assistant” so I knew the hardware works.
The ESP32 software is written using the Arduino platform and uses two sockets, one for sending audio from the microphone to the server running on the PC while the single button is pushed (PTT = Push To Talk), and one for playing audio from the PC through the server (Always waiting while the button is not pushed). I used Bing to help me with the code.
The PC runs multiple programs:
1. An audio-receiving server waits to receive audio from the STB and saves it to an incrementally named file. Written using python.
2. The main program (python) waits for new files in the received-audio-folder and sends them to an instance of Whisper STT. It receives the transcribed text and processes it using spaCy to determine if we’re Creating a new record or interrogating the database. It can extract the Thing we’re talking about, a Room and further Location information inside the room.
3. An STT server running Whisper that transforms speech audio into text that it returns to the calling program. This benefits from the Ryzen AI Software to be fast and accurate. To install Whisper, I followed the GitHub instructions to install Whisper after I installed the Ryzen AI Software.
4. An audio-sending server waits for a new file in the to-send-audio-folder and sends it to the STB where it is played to the user.
Initially I wrote code that can do CRUD operations (create, read, update, and delete) on a MongoDB database and that works very well, but then I realized that a Humanly-Readable format would be much easier to use in my application, so I settled on using JSON for the database.
Similarly, I first wrote code that uses an LLM to extract the Thing, Room, Location from the received text and transform them into JSON (we can also use JSON to talk to MongoDB), but the limited model size I could use on the available hardware meant that, while it sometimes worked, it often hallucinated so wasn’t reliable. While LLMs can be much more powerful in general, they are also much more resource hungry and for our application I found that spaCy was ideal.
I had many difficulties getting this system to run.
Installing the Ryzen AI software and getting Whisper to run fast and accurate enough was quite an adventure, that, I’m sure, should be easier next time. Many trials were run with different models, environments, etc. I’m glad I found a combination that runs in good real-time.
The hardware badge was previously used in another project but that initial design wasn’t easy. I would have preferred to use an analog microphone with AGC/compression but couldn’t get that to work so I tried the I2S microphone that worked well enough. Might try analog again at some point for increased distance.
Getting the Arduino program for the ESP32 to capture/send and receive/play audio through sockets took quite a while, only to find that I had to change the sampling rate and create file headers in order for the saved files to be playable/processable.
The code to receive audio through a socket and save it to a file started easier until I realized the file needs headers including length so Bing was again very helpful.
The main code was getting a bit unwieldy and repetitive so I separated the NLP into a separate file and a few functions into another “library”.
I’m not a software developer so I’m sure the code would be written very differently by one :-)
Comments