The Chaihuo Makerspace showcases a wide array of innovative products and projects. However, the absence of front desk personnel has resulted in a lack of personalized introductions and guidance for incoming visitors. To address this issue, I have developed an intelligent voice recognition system that serves as an interactive tour guide. This system is designed to provide detailed explanations of the products and projects on display, as well as to offer navigational assistance to guests exploring the Chaihuo Makerspace.
The core of the system is the XIAO ESP32S3 microcontroller, which is integrated with the Edge Impulse platform to facilitate advanced speech recognition capabilities. When visitors issue voice commands, the system promptly recognizes them and executes the appropriate responses. By implementing this intelligent voice guide, the Chaihuo Makerspace can significantly enhance the visitor experience
Introduction to Edge ImpluseEdge Impulse is an innovative platform tailored for the rapid development of machine learning models for edge devices and embedded systems. It arms developers with a robust toolkit and services that simplify the creation, training, and deployment of these models, all without necessitating an in-depth knowledge of machine learning theory.
The platform is equipped with user-friendly data collection utilities that streamline the process of gathering data from diverse sensors and devices. This data is then effortlessly uploaded to the Edge Impulse platform for efficient management and labeling. Advanced preprocessing and feature extraction algorithms are also at hand, automatically converting raw data into meaningful features that are essential for training accurate models.
Once a model is fully trained, Edge Impulse simplifies deployment to a spectrum of edge devices and embedded systems, including popular options like Arduino, Raspberry Pi, and various microcontrollers. Deployment methods are flexible, with options to generate optimized C++ code, binaries, or tailored SDKs.
One of Edge Impulse's standout features is its accessibility. The platform's intuitive graphical interface and guided workflow empower even novices in machine learning to quickly achieve proficiency and craft high-caliber models. A wealth of tutorials, sample projects, and a supportive community further facilitate learning and knowledge exchange. Edge Impulse's seamless integration with numerous hardware platforms and sensor ecosystems also accelerates the deployment of machine learning capabilities on edge devices.
In summary, Edge Impulse is a formidable platform that demolishes the entry barriers to machine learning, enabling developers of all levels to efficiently create and deploy sophisticated intelligent applications on edge devices. It stands as a versatile ally for both novices and seasoned professionals aiming to forge ahead in the realms of IoT and embedded intelligence.
XIAO ESP32S3 Sense Introduction Features:Powerful MCU board: Integrated ESP32S3 32-bit dual-core Xtensa processor chip operating at up to 240 MHz, multiple development ports installed, Arduino / MicroPython support
Advanced features: detachable OV2640 camera sensor with 1600*1200 resolution, compatible with OV5640 camera sensor, integrated additional digital microphone
Large memory for more possibilities: 8MB PSRAM and 8MB flash memory available, supports SD card slot for external 32GB FAT memory.
Outstanding RF performance: Supports 2.4GHz Wi-Fi and BLE dual radio communication and 100m+ remote communication when connected to U.FL antenna.
Thumb-sized compact design: 21 x 17.5mm in XIAO's classic form factor for space-constrained projects such as wearables
Capture (local) audio data
Step 1. Save recorded sound samples as.wav audio files to a microSD card.
Let's use the onboard SD Card reader to save.wav audio files, we need to habilitate the XIAO PSRAM first.
Then compile and upload the Sketch 1 to XIAO ESP32S3.
After upload the code to the XIAO, get samples from the keywords (hello and other). You can also capture noise and other words. The Serial monitor will prompt you to receive the label to be recorded.
Send the label (for example, hello). The program will wait for another command: rec.
And the program will start recording new samples every time a command rec is sent. The files will be saved as hello.1.wav, hello.2.wav, hello.3.wav, etc. until a new label (for example, stop) is sent. In this case, you should send the command rec for each new sample, which will be saved as stop.1.wav, stop.2.wav, stop.3.wav, etc.
Ultimately, we will get the saved files on the SD card.
Training data acquisitionStep 2. Uploading collected sound data
When the raw dataset is defined and collected, we should initiate a new project at Edge Impulse. Once the project is created, select the Upload Existing Data tool in the Data Acquisition section. Choose the files to be uploaded.
And upload them to the Studio (You can automatically split data in train/test). Repete to all classes and all raw data.
All data on dataset have a 1s length, but the samples recorded in the previous section have 10s and must be split into 1s samples to be compatible. Click on three dots after the sample name and select Split sample.
Once inside de tool, split the data into 1-second records. If necessary, add or remove segments.
This procedure should be repeated for all samples.
Step 3. Creating Impulse (Pre-Process / Model definition)
An impulse takes raw data, uses signal processing to extract features, and then uses a learning block to classify new data.
First, we will take the data points with a 1-second window, augmenting the data, sliding that window each 500ms. Note that the option zero-pad data is set. This is important to fill with zeros samples smaller than 1 second (in some cases, I reduced the 1000 ms window on the split tool to avoid noises and spikes).
Each 1-second audio sample should be pre-processed and converted to an image (for example, 13 x 49 x 1). We will use MFCC, which extracts features from audio signals using Mel Frequency Cepstral Coefficients, which are great for the human voice.
Next, we select KERAS for classification that builds our model from scratch by doing Image Classification using Convolution Neural Network.
Step 4. Pre-Processing (MFCC)
The next step is to create the images to be trained in the next phase. We can keep the default parameter values or take advantage of the DSP Autotuneparameters option, which we will do.
Step 5. Model Design and Training
We will use a Convolution Neural Network (CNN) model. The basic architecture is defined with two blocks of Conv1D + MaxPooling (with 8 and 16 neurons, respectively) and a 0.25 Dropout. And on the last layer, after Flattening four neurons, one for each class.
As hyper-parameters, we will have a Learning Rate of 0.005 and a model that will be trained by 100 epochs. We will also include data augmentation, as some noise. The result seems OK.
Deploying to XIAO ESP32S3 Sense
Step 6. Deploying to XIAO ESP32S3 Sense
1、 after the completion of training, click the left side [Deployment] deployment options
2、 click the search text box, pop-up menu select Arduino library.
3、 Click the bottom of the 【Build】 button to generate and download as a library file
4、 Wait for a period of time, will pop up to prompt the generation of Arduino library window. At the same time, it will automatically download an Arduino zip library file.
5、 Add this library to the Arduino.
We can use the sketch 2 to test the model.
The idea of this sketch is that the LED will be ON whenever the keyword HELLO is detected. In the same way, instead of turn-on a LED, this could be a "trigger" for an external device, as we saw in the introduction.
Grove MP3 moduleReference code to test if the MP3 module is working correctly and to check if the files in the TF card are correct. The libraries we need can be downloaded from the link https://github.com/Seeed-Studio/Seeed_Serial_MP3_Player .
If there is error occurs, like this
fatal error: circular_queue.h: No such file or directory
#include <circular_queue.h>
^~~~~~~~~~~~~~~~~~
You might need to remove the EspSoftwareSerial library from the library manager and download its 8.1.0 version.
Since the module's AUX audio output cannot change the volume and the output volume is very low we need to add an amplifier.
Button ControlIn noisy environments, the speech recognition system may be disturbed, resulting in reduced recognition accuracy. In order to improve the user experience and system reliability, we can introduce a button control mechanism so that the user can easily manage the audio playback in noisy environments by using physical buttons. This design not only increases the interaction of the system, but also ensures that the user can accurately control the content of the music playback even in the presence of high background noise. By combining button control and speech recognition, we were able to create a more flexible and user-friendly voice playback system. This can be referenced to sketch 3.
Multi-threaded ControlMultithreading is a technique that enables concurrent execution in computer programming. With multithreading, a programme can perform multiple tasks simultaneously, thus improving the efficiency and responsiveness of the programme. In the button control scenario, if the button control logic is embedded directly into the main loop, it will result in a delay in receiving the button signal due to the fact that recognising speech takes up a certain amount of time to record, and a long press of the button is required to capture the button signal. To solve this problem, we can use multi-threading technique to receive the button signal.
Specifically, we can run the reception and processing of the button signal as a separate thread. When the button is pressed, this independent thread will immediately respond and execute the corresponding processing logic without being interfered by the speech recognition task in the main loop. In this way, we can achieve a fast response to the button signal and improve the user experience.
In conclusion, the application of multi-threading technology in button control can effectively solve the problem of delayed reception of button signals due to the speech recognition task, and improve the response speed and user experience of the program. Sketch 4 will show you how to use multithreading control on XIAO ESP32S3.
RIP digital SensorsIn the final programme design, we had to give full consideration to the working habits and needs of long-term members in the space to avoid frequent voice announcements interfering with their concentration and efficiency. At the same time, considering that the project requires long-term operation of the hardware equipment, the continuous accumulation of heat may lead to premature damage to the equipment, and even affect the stability and reliability of the entire project. In order to achieve the dual goals of energy saving and prolonging device life, we will enable the sleep mode of the devices to make them enter a low-power state during non-working hours, thus effectively reducing energy consumption and prolonging device life.
However, the key issue is how to wake up the device instantly when needed to ensure the smooth running of the project and the members' experience. To this end, we plan to adopt the PIR motion sensing technology to automatically activate the XIAO ESP32S3 when someone is near, thus realising intelligent wake-up. This design ensures immediate response from the device and avoids unnecessary energy waste, achieving a perfect balance between efficiency and energy saving. Reference to Sketch 5.
Circuit DiagramThe final sketch 6 will combine and iterate on the code above.
In SummaryThe project’s development was marked by several challenges, predominantly stemming from my initial lack of familiarity with the hardware components. This learning curve inevitably prolonged the project timeline. Moreover, I observed that speech recognition and image recognition have distinct processing demands, which can result in noticeable latency if executed in a single-threaded manner. To address this, I explored the use of multi-threaded processing to optimize system performance. Multithreading enables concurrent processing of multiple tasks, enhancing the control system’s responsiveness and ensuring a more fluid and intuitive user interaction.
In bringing this project to fruition, I chose the XIAO ESP32S3 as the core hardware platform. This microcontroller offers formidable processing power and a rich set of peripheral interfaces, making it highly suitable for sophisticated intelligent speech recognition tasks. To empower the system with the capabilities of an intelligent speech guide, I leveraged a speech model trained using the Edge Impulse platform. This model is designed to accurately recognize specific voice commands and execute the appropriate actions in response, thereby delivering the intended interactive functionality.Summary.
The process of realising the project presented me with a number of challenges, mainly from unfamiliarity with the hardware, which definitely increased the time taken to complete the project. In addition, when dealing with speech recognition and image recognition, it was noticed that they differ in processing, which leads to a certain latency that may occur during single-threaded execution. In order to optimise the performance of the system, I considered introducing multi-threaded processing. With multithreading, we can process multiple tasks simultaneously, which improves the smoothness and reasonableness of the control system and enables it to better satisfy the user's interactive experience. In realising this project, we used the XIAO ESP32S3 as the core hardware platform. This microcontroller has powerful processing capabilities and rich peripheral interfaces, making it ideal for intelligent speech recognition applications. In order to provide the functionality of intelligent speech wizard, I used a speech model trained in Edge impluse, which is capable of recognising specific speech commands and performing the corresponding actions accordingly.
Comments
Please log in or sign up to comment.