Inspired by the ShAIdes project, we wanted to build a similar wearable, but with complete onboard processing using tinyML. This project allows the user to control the device they are looking at, through voice commands.
Architecture:The project consists of 2 main components:
- OpenMV camera running the image classification model
- ESP-32 running with an external microphone running Keyword spotting model to detect commands and send commands to the intended device via ESP-Now protocol.
Whenever a known object is detected (in our case a lamp and a Television), the corresponding pin is set high for 2 seconds on the OpenMV camera. During this period of 2 seconds, if a command is recognized by the keyword spotting model running on the ESP32, the recognized command (in our case either ON or OFF ) is sent to the selected device using the ESP-Now protocol.
Image Classification model:We have created an image classification model using the Edge Impulse platform. We collected around 80 images for each class. There are 3 classes in total lamp, television and unknown.
It is very important to collect pictures of all other appliances in the room/house during the dataset creation and group them under the unknown category.
The MobileNetV2 0.05 model in Edge Impulse was used for transfer learning.
The trained model was then deployed on the OpenMV camera and was able to achieve 4 inferences per second.
Training Keyword Spotting model:The Keyword spotting model was trained in google colab, to recognized "On" and "Off" keywords. You can easily retrain the model to recognize other keywords present in the speech_commands dataset.
Post-training, the model was converted to a tflite model and then, post-training quantization was done to reduce the size of the model to be deployed on embedded devices.
Save this tflite model to your desktop and run the following command to convert it into a character array. Save this file as we will be using it at a later stage.
xxd -i converted_model.tflite > model_data.cc
It takes around 2 hours for the model training to complete. So we have added the trained model_data.cc file in GitHub.
Setting up ESP-IDF:Initially, we tried deploying the keyword spotting model to ESP32, using the Arduino TensorFlow lite library for ESP32. But faced some issues with the tflite micro version compatibility issues, which always threw the following error:
Didn't find op for builtin opcode 'CONV_2D' version '2'
This error was only observed after post-training quantization. So unable to deploy using the Arduino IDE, we moved on to the official Espressif tool ESP-IDF.
You can easily install the ESP-IDF extension in Visual Studio Code, using this tutorial. Once the development environment is set up, we move onto building our tinyML application.
Building and Deploying our TinyML Application:Since we will be building our application on the TensorFlow micro_speech example, we need to first clone the TensorFlow repository to our local machine.
git clone https://github.com/tensorflow/tflite-micro.git
Move to the cloned tflite-micro folder and run the following command generating the micro speech example project for ESP32.
make -f tensorflow/lite/micro/tools/make/Makefile TARGET=esp generate_micro_speech_esp_project
Now we can open the project in VSCode from the following location
tensorflow/lite/micro/tools/make/gen/esp_xtensa-esp32_default/prj/micro_speech
Now copy the character array from the "model_data.cc" file into the "model.cc" file in the project.
Next, in the micro_model_settings.cc
file change the labels to the one you have trained the model with.
And finally, change the value of kCategoryCount
in micro_model_settings.h
file to the total number of labels.
This basic application just prints out the keyword which is detected to the serial monitor. But if we want to take some action based on the detected keywords, we need to modify the command_responder
files.
We have modified the files to send commands to the selected device using Espressif's ESP-Now protocol. You can check out the code in the GitHub repo link attached at the end.
That's it! We can now we can build and flash our firmware into ESP32. This video shows how you can build, flash and monitor using the ESP-IDF VSCode extension.
Interfacing an external Microphone with ESP32:After seeing the performance of different external microphones with ESP32 in one Blog post, we decide to use the INMP441 microphone. If you would like to use a different microphone, then change the audio_provider
files in the project accordingly. You can find the details about interfacing the microphone with ESP32 at the end.
ESP-NOW is a protocol developed by Espressif, which enables multiple devices to communicate with one another without using Wi-Fi. Its power-efficient and convenient to deploy.
We will be sending the signal from the ESP32 which is running the keyword spotting model, to the ESP32 which is connected to the device we want to control. Here we have connected the receiver ESP32 with a relay module, to control the lamp.
Combining the Visual and Audio Perception:When an object is detected on the OpenMV camera, it sets the corresponding pin high. When a keyword is detected on the ESP32, it reads the value of the pins which are connected to the camera, to determine the selected device and then sends the command to the device.
In our case, Pin P0 is set high when a lamp is detected and Pin P1 is set high when the television is detected by the camera. Simultaneously if a keyword is detected on ESP32, it checks the value of Pins to determine whether the commands need to be sent to the Lamp or the TV or neither of them.
Results:When the ESP-EYE module is pointed towards the lamp, and the "On" command is heard, a signal is sent via ESP-Now protocol from the ESP-EYE module to the ESP32 connected with the relay module which switches on the lamp. This demonstrates that our entire pipeline is working perfectly!
Future Scope:Currently, the system used for building the proof of concept is not compact enough to be used as a wearable. So we are currently working on porting this project onto the compact ESP-EYE development board which includes all the 3 required peripherals a camera, microphone, and wifi module.
To support speech-impaired users, instead of using audio control, we can use a gesture-based control mechanism. Such devices could be useful for cerebral palsy patients, by helping them control home appliances with ease and give them some independence in life.
Comments