(Note: the "easier" part I referred in this article was compared to the original Tensorflow Lite C++ tutorials on books and websites. You might well be interested in Edge Impulse, on which you can train and deploy an embedded AI model without writing a single line of code.)
(Also: after updates the EloquentTinyML library has changed a few things, and it seems only works for ARM boards now. Please look at the library's official examples under the "2.4" sub-directory in Arduino IDE.)Preface
Tensorflow Lite, also known as TinyML thanks to the O'reilly book of the same name, has since received a lot of attention. You can train and deploy an neural network prediction model - or simply call it an AI - on a microcontroller which has limited processing power and memory. Sounds wonderful, isn't it?
However, the steps and tools used in the book are really complicated. Just open any example from the Arduino_TensorflowLite library, and you'll see why. The TF Lite C++ APIs are not that well documented, either.
Thankfully, a smart guy named Simone Salerno (@EloquentArduino) has written an library EloquentTinyML (a wrapped-up version of TF Lite) for Arduino IDE, and a Python tool package TinyML gen. With both of them you can build and upload a TF Lite model to your board in a much, much simpler way.
In fact, I've found out that EloquentTinyML (at least the sample codes) can be uploaded on some of my boards like ESP32, ESP8266, Adafruit Metro M4 Express and Seeeduino XIAO. Strangely, XIAO is the only SAMD21 board I have that works without compilation errors.
Simone did wrote a blog post about using Nano 33 BLE Sense with its PDM microphone for voice classification (link), for which he used SVM (support vector machine) from his own EloquentMicroML library. But how about TF lite? Is it possible to recreate the micro speech example in the book without those complex tools and code?
So, based on Simone's and the TinyML book's work, this is what I was trying to do here. The basic goals were:
- To be able to train the model with any words (including non-English words) spoken by your own voice. (The TinyML book used Google's Speech Commands dataset as input.)
- To use as few libraries, files and develop tools as possible.
- To provide a preliminary work for other people to produce better and even easier voice/speech recognition on edge devices in the future.
First of all, I am not an expert on neural networks and the Tensorflow framework. Any mistakes of mine are welcomed to be pointed out. Still, here I assume that you already have some basic understandings about Arduino, C++, Python and machine learning classification.
The O'reilly book I'm currenting reading, Hands-On Machine Learning with Scikit-Learn and TensorFlow, has good introductions about ML and TF. (There's some math but you do not need to understand them.) The TinyML book is clearly written for people who already understand TF well enough.
Second, there is no guarantee that future changes of Tensorflow Lite or EloquentTinyML will still be compatible with these code. Tensorflow update its API frequently. Who knows what will happen after some time?
Also, using neural networks don't necessary means better - because they are essentially black boxes, you can only (blindly) try to see which setting will get you better results. And training neural network models is a long and difficult process, easily full of frustration. Not to mention the way you speak into the mic would also have a lot of impact on how the model performs. Simone's EloquentMicroML and my modified SEFR classifier may still be a better choice for simpler data.
Here I'll demonstrate a 3-word ("Yes", "No" and "OK") classifier. The model achieved a good testing accuracy after training, but the actual successful prediction rate was noticeably lower on the device. Part of the reason may be that my neural network model wasn't good enough; more possibly, it's hard to maintain the same speaking manner throughout training and prediction phase. Since my own voice is deep and a bit hoarse, it may create additional difficulties for the model.
And I had to speak very, very close to the mic to make it work. Of course, if you provide A LOT of samples (like several hundreds per word) and train the model long enough, you might get better predictions. If you ever try it, please tell me how it goes.Setup
Arduino IDE:
- Install support for Arduino Nano 33 BLE Sense (add Arduino nRF528x Boards from your board manager), which will also install the PDM library
- Install EloquentTinyML library
Python:
- Tensorflow 2.x (64-bit Python is required. I use a IDE called Thonny on Windows 10 and select Python 3.8.5 as the interpreter. I use pip3.8 to install packages in there.)
- tinymlgen (https://github.com/eloquentarduino/tinymlgen, which can also be found on PyPI)
- NumPy, matplotlib, scikit-learn
For this article I used Arduino IDE 1.8.13, Tensorflow 2.3.1, NumPy 1.18.5 (1.19.x not supported by TF), matplotlib 3.3.2 and scikit-learn 0.23.2.
Note: I've tested the training script in TF 2.3.1, 2.5.2 and 2.8.0 without issues. If your TF doesn't work, please try to upgrade it.Hardware
The only hardware needed is an Arduino Nano 33 BLE Sense. I Know, it's expensive (this is not sponsored anyway). But considering the nRF52 microcontroller (which has enormous 512 KB flash/128 KB RAM) and the nice MP34DT05 PDM microphone, it's probably one of the best choices. The other boards supported by Tensorflow Lite/mentioned in the TinyML book may be usable though. Someone has already done speech recognition on the new micro:bit V2.
The mic is located directly below the microcontroller.
First, we need to sample voice or spoken words as training data. Every "word" sample or instance is a Numpy array with 32 floating numbers. (I did tried 64 or 128 samples but this is the best one so far.)
This is how the script "record" samples from the PDM mic:
- Upload the script. When done, open the serial monitor window, set baud rate to 115200.
- In its callback function, the mic records 256 readings continuously. Sampling rate is 16 KHz, which means it gets 16000 readings per second. These 256 values are then read as 128 PDM (pulse-density modulation) data. (There's very little document about how the Arduino PDM library works, so I just copy what the official example did.)
- These PDM data will be then calculated into a single RMS (root mean square) value - in other word, a summary of this sampling window.
- If the current RMS value get higher than the threshold, it means the user has said something loud enough. This is served as a recording "trigger". The onboard LED lights would up and the recording process starts. (The beginning of the word would be lost but we can still get the rest of it.)
- The board would generate a RMS value every 20 ms for 32 times. (So it covers 640 ms or 0.64 seconds, long enough for you to speak a single word.)
- This 32-value data is the representation of one spoken word (an instance). You'll see it printed out in the serial monitor window.
- Wait until the onboard LED blinks to say the word again.
I didn't use FFT (Fast Fourier Transform) to convert original signal into frequencies, because the readings are already converted to PDM signals by the mic itself. And Simone did mentioned (before updating his blog post) that without FFT the classifier performed pretty much the same.
I actually did tried to use FFT, but probably due to the nature of the sampling process, the 2 or 3 libraries I tried always "jammed" the program and thus became useless.
[Nano33ble_voice_sampler.iso]
/*
* Voice sampler for Arduino Nano 33 BLE Sense by Alan Wang
*/
#include <math.h>
#include <PDM.h>
#define SERIAL_PLOT_MODE false // set to true to test sampler in serial plotter
#define PDM_SOUND_GAIN 255 // sound gain of PDM mic
#define PDM_BUFFER_SIZE 256 // buffer size of PDM mic
#define SAMPLE_THRESHOLD 900 // RMS threshold to trigger sampling
#define FEATURE_SIZE 32 // sampling size of one voice instance
#define SAMPLE_DELAY 20 // delay time (ms) between sampling
#define TOTAL_SAMPLE 50 // total number of voice instance
double feature_data[FEATURE_SIZE];
volatile double rms;
unsigned int total_counter = 0;
// callback function for PDM mic
void onPDMdata() {
rms = -1;
short sample_buffer[PDM_BUFFER_SIZE];
int bytes_available = PDM.available();
PDM.read(sample_buffer, bytes_available);
// calculate RMS (root mean square) from sample_buffer
unsigned int sum = 0;
for (unsigned short i = 0; i < (bytes_available / 2); i++) sum += pow(sample_buffer[i], 2);
rms = sqrt(double(sum) / (double(bytes_available) / 2.0));
}
void setup() {
Serial.begin(115200);
while (!Serial);
PDM.onReceive(onPDMdata);
PDM.setBufferSize(PDM_BUFFER_SIZE);
PDM.setGain(PDM_SOUND_GAIN);
if (!PDM.begin(1, 16000)) { // start PDM mic and sampling at 16 KHz
Serial.println("Failed to start PDM!");
while (1);
}
pinMode(LED_BUILTIN, OUTPUT);
// wait 1 second to avoid initial PDM reading
delay(900);
digitalWrite(LED_BUILTIN, HIGH);
delay(100);
digitalWrite(LED_BUILTIN, LOW);
if (!SERIAL_PLOT_MODE) Serial.println("# === Voice data start ===");
}
void loop() {
// waiting until sampling triggered
while (rms < SAMPLE_THRESHOLD);
digitalWrite(LED_BUILTIN, HIGH);
for (unsigned short i = 0; i < FEATURE_SIZE; i++) { // sampling
while (rms < 0);
feature_data[i] = rms;
delay(SAMPLE_DELAY);
}
digitalWrite(LED_BUILTIN, LOW);
// pring out sampling data
if (!SERIAL_PLOT_MODE) Serial.print("[");
for (unsigned short i = 0; i < FEATURE_SIZE; i++) {
if (!SERIAL_PLOT_MODE) {
Serial.print(feature_data[i]);
Serial.print(", ");
} else {
Serial.println(feature_data[i]);
}
}
if (!SERIAL_PLOT_MODE) {
Serial.println("],");
} else {
for (unsigned short i = 0; i < (FEATURE_SIZE / 2); i++) Serial.println(0);
}
// stop sampling when enough samples are collected
if (!SERIAL_PLOT_MODE) {
total_counter++;
if (total_counter >= TOTAL_SAMPLE) {
Serial.println("# === Voice data end ===");
PDM.end();
while (1) {
delay(100);
digitalWrite(LED_BUILTIN, HIGH);
delay(100);
digitalWrite(LED_BUILTIN, LOW);
}
}
}
// wait for 1 second after one sampling
delay(900);
digitalWrite(LED_BUILTIN, HIGH);
delay(100);
digitalWrite(LED_BUILTIN, LOW);
}
Upload the script to your Nano 33 BLE Sense.
There are some parameters that you can change. I set the RMS threshold high, otherwise the mic will be often triggered by random noises and your own breath.
Since the PDM mic is only sensitive enough in a very close range, I decided to press my mouth very close to the mic, and immediately move the board away to avoid breathing into the mic.
Testing/sampling voice dataYou can change SERIAL_PLOT_MODE to true to test it in the Arduino IDE serial plotter window: (baud rate 115200)
The plot mode doesn't count sample numbers and add some 0s between samples to seperate them. I would recommend you to use this mode for practice and find out how and where you are going to record reliable data. (Surprisingly, it's not as easy as you'd think.)
When SERIAL_PLOT_MODE set to false, you'll get the data you need:
After total 50 samples collected (it can be changed with the TOTAL_SAMPLE parameter), the board will go into an endless loop and keep blinking its LED.
Now copy and paste the data into the dataset Python script. As you can see, they are outputted in the form of Python list. You can remove the # comment if you like.
Reboot the Nano 33 BLE Sense and collect 50 samples for the next word.
The voice datasetNow we put together the voice dataset in a Python script:
[voice_dataset.py]
import numpy as np
NUMBER_OF_LABELS = 3
DATA_SIZE_OF_LABEL = 50 # number of instances for each label
data = np.array([
[976.44, 809.81, 852.16, 795.61, 733.75, 743.48, 766.01, 643.91, 815.27, 541.93, 388.19, 466.88, 455.32, 410.88, 1723.84, 651.68, 1066.49, 1552.68, 1886.37, 1434.68, 700.44, 450.38, 136.17, 73.71, 220.99, 276.30, 421.08, 341.11, 306.07, 250.11, 317.13, 319.75, ],
[900.02, 1324.65, 1553.57, 1300.46, 768.41, 1315.89, 1572.04, 1284.38, 898.83, 725.21, 566.74, 449.95, 230.06, 97.65, 64.58, 171.64, 341.67, 407.33, 516.53, 607.64, 717.49, 753.78, 779.85, 760.53, 711.42, 669.78, 6
...(omitted)
[2247.54, 731.01, 225.72, 2644.80, 3746.85, 415.67, 712.12, 765.10, 769.43, 806.61, 683.78, 518.41, 161.55, 130.77, 120.98, 314.71, 476.08, 528.22, 561.84, 522.31, 189.19, 124.10, 88.45, 280.16, 348.51, 452.32, 348.11, 272.22, 153.52, 90.54, 22.94, 37.59, ],
])
target = np.array(
[label for label in range(NUMBER_OF_LABELS) for _ in range(DATA_SIZE_OF_LABEL)]
)
Replace your data between data = np.array([
and ])
and set the correct number of labels and data size for each label.
Remember to set the correct number of labels. I also assume that each label has equal number of instances in the dataset: the target (label) data will be generated automatically.
So
- label 0 = "Yes"
- label 1 = "No"
- label 2 = "OK"
Now enters the hardest and most mysteries part: try to train a neural network that is good enough for prediction. Which will take a lot of time and effort.
In the following code, I used a model like this:
model = Sequential()
model.add(layers.Dense(data.shape[1], activation='relu', input_shape=(data.shape[1],)))
model.add(layers.Dropout(0.25))
model.add(layers.Dense(np.unique(target).size * 4, activation='relu'))
model.add(layers.Dropout(0.25))
model.add(layers.Dense(np.unique(target).size, activation='softmax'))
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
model.summary()
This defines a simple neural network with five layers, three of them are fully-connected layer (dense) and another two are dropout layers; each layer has neural nodes which would pass data to the next layer, and get the result in the final one.
Note: I did tried to use a Conv1d or Conv2d (with reshaped input) convolution layer to scan patterns, but the model didn't really run on the Nano 33 BLE. So right now I'm stuck with Dense layers.
The first layer is as big as the length of the data instance (32 nodes). The third one is the number of labels x 4 (= 12 nodes). The final one has 3 nodes, for which we will get prediction results. The Dropout layers are used to prevent over-fitting, both of them will randomly discard 1 input data in 4 in order to force the rest of nodes to adapt.
Activation functions are like filters, which controls how a node send data to the next ones (or if a node will be "fired" to send information, like the neurons in our brain).
When training the model, Tensorflow will try to optimize best weights in each nodes based on prediction accuracy and loss from previous iteration (or epoch). It's like trying to find a way down hill by blindly walking around. However, it may also be stuck in the same place for a very, very long time, unable to improve the model further.
Softmax (the multi-class version of Logistic Regression) and loss function sparse_categorical_crossentropy are used for classification; in the final layer of the model they will generate three floating numbers as probability for each label. The label with highest probability is the final "predicted" word.
For that case, you'll need to change some parameters (number of nodes, dropout ratio, batch size (training speed) and number of training iterations) to see if it will get better.
(I cannot tell you how, and as far as I know, there are no best practice to follow unless you understand the math behind it very well.)
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense (Dense) (None, 32) 1056
_________________________________________________________________
dropout (Dropout) (None, 32) 0
_________________________________________________________________
dense_1 (Dense) (None, 12) 396
_________________________________________________________________
dropout_1 (Dropout) (None, 12) 0
_________________________________________________________________
dense_2 (Dense) (None, 3) 39
=================================================================
Total params: 1,491
Trainable params: 1,491
Non-trainable params: 0
Be noted that EloquentTinyML (and TF Lite?) currently only supports ReLU, ReLU6 and softmax activation functions (see here) and a certain types of layers. This makes the mode training in the next step a bit difficult. Using unsupported features will make the model "failed to allocate memory" on the the Nano 33 BLE Sense. A model with a lot of nodes may also failed to initialize.
[Nano33ble_voice_trainer.py]
'''
Voice trainer for Arduino Nano 33 BLE Sense and Tensorflow Lite by Alan Wang
Required packages:
Tensorflow 2.x
tinymlgen (https://github.com/eloquentarduino/tinymlgen)
NumPy
matplotlib
scikit-learn
'''
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' # only print out fatal log
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
import tensorflow as tf
from tensorflow.keras import layers, Sequential
# force computer to use CPU if there are no GPUs present
tf.config.list_physical_devices('GPU')
tf.test.is_gpu_available()
# set random seed
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)
tf.random.set_seed(RANDOM_SEED)
# import dataset and transform target (label data) to categorical arrays
from voice_dataset import data, target
# create training data (60%), validation data (20%) and testing data (20%)
data_train, data_test, target_train, target_test = train_test_split(
data, target, test_size=0.2, random_state=RANDOM_SEED)
data_train, data_validate, target_train, target_validate = train_test_split(
data_train, target_train, test_size=0.25, random_state=RANDOM_SEED)
# create a TF model
model = Sequential()
model.add(layers.Dense(data.shape[1], activation='relu', input_shape=(data.shape[1],)))
model.add(layers.Dropout(0.25))
model.add(layers.Dense(np.unique(target).size * 4, activation='relu'))
model.add(layers.Dropout(0.25))
model.add(layers.Dense(np.unique(target).size, activation='softmax'))
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
model.summary()
# training TF model
ITERATION = 2000
BATCH_SIZE = 16
history = model.fit(data_train, target_train, epochs=ITERATION, batch_size=BATCH_SIZE,
validation_data=(data_validate, target_validate))
predictions = model.predict(data_test)
test_score = model.evaluate(data_test, target_test)
# get the predicted label based on probability
predictions_categorical = np.argmax(predictions, axis=1)
# display prediction performance on validation data and test data
print('Prediction Accuracy:', accuracy_score(target_test, predictions_categorical).round(3))
print('Test accuracy:', round(test_score[1], 3))
print('Test loss:', round(test_score[0], 3))
print('')
print(classification_report(target_test, predictions_categorical))
# convert TF model to TF Lite model as a C header file (for the classifier)
from tinymlgen import port
with open('tf_lite_model.h', 'w') as f: # change path if needed
f.write(port(model, optimize=False))
# visualize prediction performance
DISPLAY_SKIP = 100
import matplotlib.pyplot as plt
accuracy = history.history['accuracy']
loss = history.history['loss']
val_loss = history.history['val_loss']
val_accuracy = history.history['val_accuracy']
epochs = np.arange(len(accuracy)) + 1
plt.rcParams['font.size'] = 12
plt.figure(figsize=(14, 8))
plt.subplot(211)
plt.title(f'Test accuracy: {round(test_score[1], 3)}')
plt.plot(epochs[DISPLAY_SKIP:], accuracy[DISPLAY_SKIP:], label='Accuracy')
plt.plot(epochs[DISPLAY_SKIP:], val_accuracy[DISPLAY_SKIP:], label='Validate accuracy')
plt.grid(True)
plt.legend()
plt.subplot(212)
plt.title(f'Test loss: {round(test_score[0], 3)}')
plt.plot(epochs[DISPLAY_SKIP:], loss[DISPLAY_SKIP:], label='Loss', color='green')
plt.plot(epochs[DISPLAY_SKIP:], val_loss[DISPLAY_SKIP:], label='Validate loss', color='red')
plt.grid(True)
plt.legend()
plt.tight_layout()
plt.show()
Run the script and patiently wait for the result. By default the tf_lite_model.h file would be generated in the same folder of this Python script.
Note: you might see some warning messages when this script starts:
WARNING:tensorflow:From C:\xxx\Nano33ble_voice_trainer.py:28: is_gpu_available (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.config.list_physical_devices('GPU')` instead.
Also when using tinymlgen, you may see
WARNING:tensorflow:From C:\Users\xxx\AppData\Roaming\Python\Python38\site-packages\tensorflow\python\training\tracking\tracking.py:111: Model.state_updates (from tensorflow.python.keras.engine.training) is deprecated and will be removed in a future version.
Instructions for updating:
This property should not be used in TensorFlow 2.0, as updates are applied automatically.
WARNING:tensorflow:From C:\Users\xxx\AppData\Roaming\Python\Python38\site-packages\tensorflow\python\training\tracking\tracking.py:111: Layer.updates (from tensorflow.python.keras.engine.base_layer) is deprecated and will be removed in a future version.
Instructions for updating:
This property should not be used in TensorFlow 2.0, as updates are applied automatically.
Both warning are currently normal and would not affect the outputted model .
Personally, I have a GTX 1660Ti graphic card (bought for gaming purpose) and already installed necessary libraries. On my computer TF will use it for training. You can see that my GPU was actually running with the temperature slightly rising. But the script can still be run on computers without supported GPU - at least for now.
And below was pretty much the best result I can get:
Prediction Accuracy: 0.9
Test accuracy: 0.9
Test loss: 0.95
precision recall f1-score support
0 0.89 0.80 0.84 10
1 0.80 0.89 0.84 9
2 1.00 1.00 1.00 11
accuracy 0.90 30
macro avg 0.90 0.90 0.89 30
weighted avg 0.90 0.90 0.90 30
Believe me, overall 90% accuracy looks great but it doesn't count in the human error on speaking to the mic in the wrong way.
Here's the visualization of the training process, which is useful to see how did the training went:
Ideally, we need to get accuracy/validate accuracy >= 0.8-0.9 and loss/val_loss as low as possible. Validate accuracy should also be as close as accuracy and validate loss as close as loss, to make sure the model wasn't over-fitting (over trained on training data and sucks on predicting test data).
Again, with more samples in the dataset available, the training result may improve.
The generated TF Lite modelThe tinymlgen package basically automates the process of converting TF model to its Lite version, then convert it to C++ (which you have to do all of these by yourself in the TinyML book).
I simply write the result (C++ code string) into a .h file: (It can be found in the same directory of Nano33ble_voice_trainer.py, unless you changed the output path in the script.)
[tf_lite_model.h]
#ifdef __has_attribute
#define HAVE_ATTRIBUTE(x) __has_attribute(x)
#else
#define HAVE_ATTRIBUTE(x) 0
#endif
#if HAVE_ATTRIBUTE(aligned) || (defined(__GNUC__) && !defined(__clang__))
#define DATA_ALIGN_ATTRIBUTE __attribute__((aligned(4)))
#else
#define DATA_ALIGN_ATTRIBUTE
#endif
const unsigned char model_data[] DATA_ALIGN_ATTRIBUTE = {0x1c, 0x00, 0x00, 0x00, 0x54, 0x46, 0x4c, 0x33, 0x00, 0x00,
...(omitted) 0x00, 0x00, 0x00};
const int model_data_len = 7644;
This file is the one to be used by the classifier below.
Part 3: voice classifierSave the following script in Arduino IDE, close it and copy tf_lite_model.h into the same directory of the Nano33ble_voice_classifier.iso file below. Close Arduino IDE, re-open the Arduino script and the header file should be now imported as well. (This is also how you update the model if needed.)
[Nano33ble_voice_classifier.ino]
/*
* Voice classifier for Arduino Nano 33 BLE Sense by Alan Wang
*/
#include <math.h>
#include <PDM.h>
#include <EloquentTinyML.h> // https://github.com/eloquentarduino/EloquentTinyML
#include "tf_lite_model.h" // TF Lite model file
#define PDM_SOUND_GAIN 255 // sound gain of PDM mic
#define PDM_BUFFER_SIZE 256 // buffer size of PDM mic
#define SAMPLE_THRESHOLD 900 // RMS threshold to trigger sampling
#define FEATURE_SIZE 32 // sampling size of one voice instance
#define SAMPLE_DELAY 20 // delay time (ms) between sampling
#define NUMBER_OF_LABELS 3 // number of voice labels
const String words[NUMBER_OF_LABELS] = {"Yes", "No", "OK"}; // words for each label
#define PREDIC_THRESHOLD 0.6 // prediction probability threshold for labels
#define RAW_OUTPUT true // output prediction probability of each label
#define NUMBER_OF_INPUTS FEATURE_SIZE
#define NUMBER_OF_OUTPUTS NUMBER_OF_LABELS
#define TENSOR_ARENA_SIZE 4 * 1024
Eloquent::TinyML::TfLite<NUMBER_OF_INPUTS, NUMBER_OF_OUTPUTS, TENSOR_ARENA_SIZE> tf_model;
float feature_data[FEATURE_SIZE];
volatile float rms;
bool voice_detected;
// callback function for PDM mic
void onPDMdata() {
rms = -1;
short sample_buffer[PDM_BUFFER_SIZE];
int bytes_available = PDM.available();
PDM.read(sample_buffer, bytes_available);
// calculate RMS (root mean square) from sample_buffer
unsigned int sum = 0;
for (unsigned short i = 0; i < (bytes_available / 2); i++) sum += pow(sample_buffer[i], 2);
rms = sqrt(float(sum) / (float(bytes_available) / 2.0));
}
void setup() {
Serial.begin(115200);
while (!Serial);
PDM.onReceive(onPDMdata);
PDM.setBufferSize(PDM_BUFFER_SIZE);
PDM.setGain(PDM_SOUND_GAIN);
if (!PDM.begin(1, 16000)) { // start PDM mic and sampling at 16 KHz
Serial.println("Failed to start PDM!");
while (1);
}
pinMode(LED_BUILTIN, OUTPUT);
// wait 1 second to avoid initial PDM reading
delay(900);
digitalWrite(LED_BUILTIN, HIGH);
delay(100);
digitalWrite(LED_BUILTIN, LOW);
// start TF Lite model
tf_model.begin((unsigned char*) model_data);
Serial.println("=== Classifier start ===\n");
}
void loop() {
// waiting until sampling triggered
while (rms < SAMPLE_THRESHOLD);
digitalWrite(LED_BUILTIN, HIGH);
for (int i = 0; i < FEATURE_SIZE; i++) { // sampling
while (rms < 0);
feature_data[i] = rms;
delay(SAMPLE_DELAY);
}
digitalWrite(LED_BUILTIN, LOW);
// predict voice and put results (probability) for each label in the array
float prediction[NUMBER_OF_LABELS];
tf_model.predict(feature_data, prediction);
// print out prediction results;
// in theory, you need to find the highest probability in the array,
// but only one of them would be high enough over 0.5~0.6
Serial.println("Predicting the word:");
if (RAW_OUTPUT) {
for (int i = 0; i < NUMBER_OF_LABELS; i++) {
Serial.print("Label ");
Serial.print(i);
Serial.print(" = ");
Serial.println(prediction[i]);
}
}
voice_detected = false;
for (int i = 0; i < NUMBER_OF_LABELS; i++) {
if (prediction[i] >= PREDIC_THRESHOLD) {
Serial.print("Word detected: ");
Serial.println(words[i]);
Serial.println("");
voice_detected = true;
}
}
if (!voice_detected && !RAW_OUTPUT) Serial.println("Word not recognized\n");
// wait for 1 second after one sampling/prediction
delay(900);
digitalWrite(LED_BUILTIN, HIGH);
delay(100);
digitalWrite(LED_BUILTIN, LOW);
}
Now upload this script to your board (it would take a while to compile with a new TF model).
Model prediction in actionYou can see that the classifier script collects voice data in the exact same way like the sampler script. The difference is that the classifier would feed those data to the model and get predictions.
Here is some example of the output: (the first five lines are generated by EloquentTinyML itself. Also, the RAW_OUTPUT parameter in the script (set to true) can be used to print out prediction probabilities of each label.
Start
GetModel done
Version check done
AllocateTensors done
Begin done
=== Classifier start ===
Predicting the word:
Label 0 = 0.99
Label 1 = 0.00
Label 2 = 0.01
Word detected: Yes
Predicting the word:
Label 0 = 0.20
Label 1 = 0.80
Label 2 = 0.00
Word detected: No
Predicting the word:
Label 0 = 0.12
Label 1 = 0.00
Label 2 = 0.88
Word detected: OK
Predicting the word:
Label 0 = 1.00
Label 1 = 0.00
Label 2 = 0.00
Word detected: Yes
Predicting the word:
Label 0 = 0.20
Label 1 = 0.80
Label 2 = 0.00
Word detected: No
Predicting the word:
Label 0 = 0.00
Label 1 = 0.00
Label 2 = 1.00
Word detected: OK
Final thoughtsLike I mentioned before, I had trouble to maintain the way of speaking into the mic during the lengthy sampling process. Also, the model is quite limited due to the device's memory. This is probably why I can't get highly reliable results. There is clearly a lot of room for future improvement.
However, this project successfully demonstrated that you can simplified the Tensorflow Lite training/deployment process down to total only 5 files, and everything can be done in Arduino IDE plus a standard Python environment on your local end. You can train with your own voice/words and customize the neural network model in anyway you like.
That's pretty much about it! Now, I'm going to use my graphic card to run some game for my next training session...
=== Some Follow-ups ===I finally tried Edge Impulse later in mid-2021, using Google's speech commands dataset to train a model which can classify 5 words along with background noises. Since Edge Impulse can process sound files a lot better, and I can upload 2000+ samples per word, the result was astonishing accurate.
So is there anything I can do to improve this experiment?
I did tried to modify the sampler code to use frequency data instead, generated by the arduinoFFT library. The code still uses RMS as volume threshold trigger, but after that the collected data would be 32 peak frequencies. It actually works very well.
However...my Tensorfow model seemed to perform worse than using RMS data, achieving only 50-60% accuracy, probably due to voice frequency doesn't change much in speech and lack of enough samples. So for now, this is a no-go.
I've also tried to add some additional stuff in the trainer script, to save the best model and generate log (for Tensorboard):
# callbacks for the TF model (save best model and log)
checkpoint_file = 'voice_classifier'
log_file = 'voice_classifier_log'
tf_callback = [
tf.keras.callbacks.ModelCheckpoint(
filepath=checkpoint_file,
save_weights_only=True,
monitor='accuracy',
mode='max',
save_best_only=True),
tf.keras.callbacks.TensorBoard(log_file)
]
# create a TF model
model = Sequential()
model.add(layers.Dense(data.shape[1], activation='relu', input_shape=(data.shape[1],)))
model.add(layers.Dropout(0.25))
model.add(layers.Dense(np.unique(target).size * 4, activation='relu'))
model.add(layers.Dropout(0.25))
model.add(layers.Dense(np.unique(target).size, activation='softmax'))
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
model.summary()
# training TF model
ITERATION = 3000
BATCH_SIZE = 4
history = model.fit(data_train, target_train,
epochs=ITERATION, batch_size=BATCH_SIZE,
validation_data=(data_validate, target_validate),
callbacks=tf_callback)
model.load_weights(checkpoint_file) # load best model
predictions = model.predict(data_test) # make predictions
test_score = model.evaluate(data_test, target_test)
And this is the result I got:
Prediction Accuracy: 0.833
Test accuracy: 0.833
Test loss: 0.872
precision recall f1-score support
0 0.78 0.70 0.74 10
1 0.88 0.78 0.82 9
2 0.85 1.00 0.92 11
accuracy 0.83 30
macro avg 0.83 0.83 0.83 30
weighted avg 0.83 0.83 0.83 30
So the "best model" only has 83% accuracy on test data, which may explain why I couldn't get good results.
Anyway, if you are interested in how to do FFT on Arduino Nano 33 BLE Sense, check out this article. The script can be uploaded to Nano RP2040 Connect as well, but it's PDM mic seems to work a bit differently.
Comments