Imagine you have a product that needs to react to a very specific keyword, and you're planning to sell millions of it worldwide. The problem is you don't speak French, Chinese, or Spanish. Hiring people to provide voice samples could be expensive and time-consuming. This is where generative AI comes in. In this article, I'll demonstrate how to use Edge Impulse to generate voice samples, train a keyword spotting model, and deploy it to the tiny ESP32-S3 based development board from Seeed Studio, XIAO ESP32S3 (Sense).
Below is the accompanying video for this article, make sure to watch it as well, as article and video are meant to complement each other (the video ends with deploying to Arduino RP2040 Connect, so this part is different and elaborated below).
To get started, we’ll use the Whisper API for Text to Speech. Whisper provides high-quality sound in various voices and supports multiple languages. First, create a new project in Edge Impulse. Once you’re in, head to the Data Acquisition tab and choose synthetic data. You will need either Pro Tier subscription or an Enterprise Plan - you can sign up for a free 14-day Enterprise Trial here. Additionally you will need to enter your OPENAI_API_KEY in Organization -> API Keys tab).
NB: There is no language selection here or in Whisper API. Instead you simply enter the words in the language of your choice - it is possible that in order for Whisper to recognize the language correctly you might need to enter a few words and then cut the unnecessary ones out - e.g. "Hospital (español)" for Spanish pronunciation of the word hospital.
Next, generate voice samples for the labels of your choice, I picked stop (停), forward (前进), back (撤销), left (左转), and right (右转) — commands suitable for a mobile robot platform. Enter each word, leave the other parameters at their default settings, and generate some samples. You’ll find that the generated samples are high quality—probably better than your own pronunciation if you’re not a native speaker.
Since we’re using Edge Impulse’s few-shot keyword spotting feature, we don’t need many samples. Generating about 50 samples for each label should be sufficient. For more robust results aim at 100 samples per class. After creating all the samples, rebalance the classes if necessary, making sure there is an equal amount of samples in each class.
Additionally - very important - you need an "unknown" class, that includes other words and background noise. You can either use samples from or also generate them using ElevenLabs Syntethic Sounds Generator.
Pick MFE as DSP block and Transfer Learning (Keyword Spotting) as the Learning Block.
Choose a smaller model (alpha 0.1) and reduce validation percentage to 0.2 - for other training parameters, you can leave them on default. You should get around 90% accuracy, both on validation and testing.
Before you start with deployment, make sure you have XIAO ESP32S3 (Sense) set up in Arduino IDE, following this wiki article. It is advised to use 2.x version of Arduino core for ESP32 and not 3.x.
For deployment, Edge Impulse we can use Arduino Library deployment option (find it using search function in Deployment tab). Download the Arduino library, then open the microphone example sketch using the Arduino IDE (go to Examples -> name of your project in Edge Impulse -> esp32 -> esp32_microphone). We will need to slightly modify the sketch to fit the specifics of XIAO ESP32S3 (Sense).
You can find the sketch in the article attachments, main the pin_config and i2s_port_t were modified. The instructions fro microphone sketch modification were taken from this forum post.
Additionally, you want to enable PSRAM in the Arduino IDE Tools menu and (only if you use EON Compiler option for deployment), possibly you will need to increase the EI_MAX_OVERFLOW_BUFFER_COUNT in /Arduino/libraries/[name-of-your-project]/src/edge-impulse-sdk/porting, you can find the correct value by a bit of trial and error. The last point is necessary, because Arduino library deployment is not geared specifically for ESP32-S3 and the arena size needed to utilize ESP32 neural network optimizations is slightly larger than the default one.
After that, upload the sketch and try it out!
Comments
Please log in or sign up to comment.