Published August 25, 2022 © GPL3+

TinyML Made Easy: Sound Classification (KWS)

We are continuing our exploration of Machine Learning on a giant tiny device, the Seeed XIAO BLE Sense. And now, classifying sound waves.

IntermediateFull instructions provided8 hours5,255

TinyML Made Easy: Sound Classification (KWS)

Things used in this project

Hardware components

Seeed Studio XIAO nRF52840 Sense (XIAO BLE Sense)

Seeed Studio XIAO Expansion Board

Software apps and online services

Edge Impulse Studio

Arduino IDE

Story

Introduction

In my tutorial, TinyML Made Easy: Anomaly Detection & Motion Classification, we explored Embedded Machine Learning, or simply, TinyML, running on the robust and still very tiny device, the Seed XIAO BLE Sense. In that tutorial, besides installing and testing the device, we explored motion classification using real data signals from its onboard accelerometer. In this new project, we will use the same XIAO BLE Sense to classify sound, explicitly working as "Key Word Spotting" (KWS). A KWS is a typical TinyML application and an essential part of a voice assistant.

But, how a voice assistant works?

For starting, it is essential to realize that Voice Assistants on the market, like Google Home or Amazon Echo-Dot, only react to humans when they are “waked up” by particular keywords such as “ Hey Google” on the first one and “Alexa” on the second.

In other words, the complete process of recognizing voice commands is based on a multi-stage model or Cascade Detection.

Stage 1: A smaller microprocessor inside the Echo-Dot or Google Home continuously listens to the sound, waiting for the keyword to be spotted. For such detection, a TinyML model at the edge is used (KWS application).

Stage 2: Only when triggered by the KWS application on Stage 1 is the data sent to the cloud and processed on a larger model.

In this project, we will focus on Stage 1 ( KWS or Keyword Spotting), where we will use the XIAO BLE Sense, which has a digital microphone that will be used to spot the keyword.

If you want to go deeper on a full project, plese see my tutorial: Building an Intelligent Voice Assistant From Scratch, where I Emulate a Google Assistant on a RaspberryPi and Arduino Nano 33 BLE.

The KWS Project

The below diagram will give an idea of how the final KWS application should work (during inference):

Our KWS application will recognize three classes of sound:

Keyword 1: UNIFEI(the name of my university)
Keyword 2: IESTI(the name of my institute)
"SILENCE" (no keywords spoken, only background noise is present)

Optionally for real world projects, it is advised to include different words than keywords 1 and 2, on the class "Silence" (or Background) or even create an extra class with such words (for example a class "others").

The Machine Learning workflow

The main component of the KWS application is its model. So, we must train such a model with our specific keywords:

Dataset

The critical component of Machine Learning Workflow is the dataset. Once we have decided on specific keywords (UNIFEI and IESTI), all dataset should be created from zero. When working with accelerometers, creating a dataset with data captured by the same type of sensor was essential. In the case of sound, it is different because of what we will classify is audio data.

The key difference between sound and audio is their form of energy. Sound is mechanical wave energy (longitudinal sound waves) that propagate through a medium causing variations in pressure within the medium. Audio is made of electrical energy (analog or digital signals) that represent sound electrically.

The sound waves should be converted to audio data when we speak a keyword. The conversion should be done by sampling the signal generated by the microphone in 16KHz with a 16bits depth.

So, any device that can generate audio data with this basic specification (16Khz/16bits) will work fine. As a device, we can use the proper XIAO BLE Sense, a computer, or even your mobile phone.

Capturing online Audio Data with Edge Impulse and a smartphone

In the tutorial, TinyML Made Easy: Anomaly Detection & Motion Classification, we learned how to install and test our device using the Arduino IDE and connect it to Edge Impulse Studio for data capturing. For that, we use the EI CLI function Data Forwarder, but according to Jan Jongboom, Edge Impulse CTO, audio goes too fast for the data forwarder to be captured. If you have PCM data already, then turning it into a WAV file and uploading it with the uploader is the easiest. With accelerometers, our sample frequency was around 100Hz, with audio being 16KHz.

So, we can not connect the XIAO directly to the Studio yet (but Edge Impulse should do it soon!). But we can capture sound using any smartphone connected online with them. We will not explore this option here, but you can easily follow EI documentation and tutorial.

Capturing (offline) Audio Data with the XIAO BLE Sense

The easiest way to capture audio and save it locally as.wav file is using an expansion board for the XIAO family of devices, the Seeed Studio XIAO Expansion board.

1 / 3

This expansion board enables to build of prototypes and projects easily and quickly, using its rich peripherals such as OLED Display, SD Card interface, RTC, passive buzzer, RESET/User button, 5V servo connector, and multiple data interfaces.

This tutorial will focus on classifying keywords, and the MicroSD card available on the device will be very important in helping us with data capture.

Saving recorded audio from the microphone on an SD card

Connect the XIAO BLE Sense on the Expansion Board and insert an SD card into the SD card slot at the back.

The SD Card should be pre-formated as FAT or exFAT.

Next, download the Seeed_Arduino_FS Library as a zip file:

And install the downloaded library: Seeed_Arduino_Mic-master.zip on your Arduino IDE:

Sketch -> Include Library -> Add .ZIP Library...

Next, navigate to File > Examples > Seeed Arduino Mic > mic_Saved_OnSDcard to open the sketch: mic_Saved_OnSDcard.

Each time you press the reset button, a 5 seconds audio sample is recorded and saved on the SD card. I changed the original file to add LEDs to help during the recording process as below:

During the time that LED Red is ON is possible to record ==> RECORD
During the file writing process, LED Red is OFF ==> WAIT
When finished writing, LED Green is ON ==> Press Reset Button once and wait for LED Red ON again, and proceed with a new sample recording

I realized that sometimes at the beginning and the end of each sample, a "spike" was recorded, so I cut the initial 300ms from each 5s sample. The spike verified at the end always happened after the recording process and should be eliminated on Edge Impulse Studio before training. Also, I increased the microphone gain to 30 dB.

The complete file (Xiao_mic_Saved_OnSDcard.ino) can be found on my Git Hub (3_KWS): Seeed-XIAO-BLE-Sense.

During the recording process, the.wav file names are shown on Serial Monitor:

Take the SD card from the Expansion Board and insert it into your computer:

The files are ready to be uploaded to Edge Impulse Studio

Capturing (offline) Audio Data with a smartphone or PC

Alternatively, you can use your PC or smartphone to capture audio data with a sampling frequency of 16KHz and a bit depth of 16 Bits. A good app for that is Voice Recorder Pro(IOS). You should save your record as.wav files and send them to your computer.

Note that any smartphone app can be used for audio recording or even your computer, for example using Audacity.

Training model with Edge Impulse Studio

When the raw dataset is created, you should initiate a new project at Edge Impulse Studio:

Once the project is created, go to the Data Acquisition section and select the Upload Existing Data tool. Choose the files to be uploaded, for example, I started uploading the samples recorded with the XIAO BLE Sense:

The samples will now appear in the Data acquisition section:

Click on three dots after the sample name and select Split sample. Once inside de tool, split the data into 1-second records (try to avoid start and end portions):

This procedure should be repeated for all samples. After that, upload other class samples (IESTI and SILENCE) captured with the XIAO and your PC or smartphone.

Note: For longer audio files (minutes), first, split into 10-second segments and after that, use the tool again to get the final 1-second splits.

In the end, my dataset has around 70 1-second samples for each class:

Now, you should split that dataset into Train/Test. You can do it manually (using the three dots menu, moving samples individually), or you can use the option Perform Train / Test Split on Dashboard - Danger Zone.

We can optionally check all dataset using the tab Data Explorer. The data points seem apart, which means that the classification model should work:

Creating Impulse (Pre-Process / Model definition)

An impulse takes raw data, uses signal processing to extract features, and then uses a learning block to classify new data.

First, we will take the data points with a 1-second window, augmenting the data, sliding that window each 500ms. Note that the option zero-point pad is set. This is important to fill with zeros samples smaller than 1 second (in some cases, I reduced the 1000 ms window on the split tool to avoid noises and spikes.

Each 1-second audio sample should be pre-processed, converting it to an image. For that, we will use MFCC, which extracts features from audio signals using Mel Frequency Cepstral Coefficients, which are great for the human voice.

For classification, we will select KERAS, which means we will build our model from scratch (Image Classification, using Convolution Neural Network).

Pre-Processing (MFCC)

The next step is to create the images to be trained in the next phase:

We will keep the default parameter values. We do not spend much memory to pre-process data (only 17KB), but the processing time is relatively high (177 ms for a Cortex-M4 CPU as our XIAO. Save parameters and generate features:

If you want to go further with how to convert temporal serial data into images, using FFT, Spectogram, etc., you can play with this CoLab: IESTI01_Audio_Raw_Data_Analisys.ipynb

Model Design and Training

The model that we will use is a Convolution Neural Network (CNN). We will use two blocks of Conv1D + MaxPooling (with 8 and 16 neurons, respectively) and a 0.25 Dropout. And on the last layer, after Flattening three neurons, one for each class:

As hyper-parameters, we will have a Learning Rate of 0.005 and a model that will be trained by 100 epochs. The result seems OK:

If you want to understant what is going on "under the hood", you can download the Keras model as a Jupyter nootebook (use the three dots menu) and play with the code by yourself. This CoLab gives an idea about how you can go further : KWS Classifier Project - Looking “Under the hood”

Testing

Testing the model with the data put apart before training (Test Data), we got an accuracy of 75%. Based on the small amount of data used, it is OK, but I strongly suggest increasing the number of samples.

Collecting more data, the Test accuracy moved up around 5%, going from 75% to around 81%:

Now, we can proceed with the project, but before deployment on our device, it is possible to perform Live Classification using a Smart Phone, confirming that the model is working with live and real data:

Deploy and Inference

The Studio will package all the needed libraries, preprocessing functions, and trained models, downloading them to your computer. You should select the option Arduino Library and at the bottom, select Quantized (Int8) and Build.

A Zip file will be created and downloaded to your computer:

On your Arduino IDE, go to the Sketch tab and select the option Add .ZIP Library.

And Choose the.zip file downloaded by the Studio:

Now it is time for a real test. We will make inferences wholly disconnected from the Studio. Let's change one of the code examples created when you deploy the Arduino Library.

In your Arduino IDE, go to the File/Examples tab and look for your project, and on examples, select nano_ble33_sense_microphone_continuous:

Even though the XIAO is not the same as the Arduino, both have the same MPU and PDM microphone, so the code works as it is. Upload the sketch to XIAO and open the Serial Monitor. Start talking one or another Keyword and confirm that the model is working correctly:

Postprocessing

Now that we know that the model is working by detecting our two keywords, let's modify the code so we can see the result with the XIAO BLE Sense completely offline (disconnected from the PC and powered by a battery).

The idea is that whenever the keyword UNIFEI is detected, the LED Red will be ON; if it is IESTI, LED Green will be ON, and if it is SILENCE (No Keyword), both LEDs will be OFF.

If you have the XIAO BLE Sense installed on the Expansion Board, we can display the class label and its probability. Otherwise, use only the LEDs.

Let's go by Parts:

Installing and Testing the SSD Display

In your Arduino IDE, Install the u8g2library and run the below code for testing:

#include <Arduino.h>
#include <U8x8lib.h>
#include <Wire.h>
 
U8X8_SSD1306_128X64_NONAME_HW_I2C u8x8(PIN_WIRE_SCL, PIN_WIRE_SDA, U8X8_PIN_NONE);   
 
void setup(void) {
  u8x8.begin();
  u8x8.setFlipMode(0);   // set number from 1 to 3, the screen word should rotate 180
}
 
void loop(void) {
  u8x8.setFont(u8x8_font_chroma48medium8_r);
  u8x8.setCursor(0, 0);
  u8x8.print("Hello World!");
}

And you should see the "Hello World" displayed on the SSD:

Now, let's create some functions that depending on the values of pred_index and pred_value, will trigger the proper LED and display the class and probability. The code below will simulate some inference results and present them on display and LEDs:

/* Includes ---------------------------------------------------------------- */
#include <Arduino.h>
#include <U8x8lib.h>
#include <Wire.h>

#define NUMBER_CLASSES 3

/** OLED */
U8X8_SSD1306_128X64_NONAME_HW_I2C oled(PIN_WIRE_SCL, PIN_WIRE_SDA, U8X8_PIN_NONE);  

int pred_index = 0;     
float pred_value = 0; 
String lbl = " ";


void setup() {
    pinMode(LEDR, OUTPUT);
    pinMode(LEDG, OUTPUT);
    pinMode(LEDB, OUTPUT);
    
    digitalWrite(LEDR, HIGH);
    digitalWrite(LEDG, HIGH);
    digitalWrite(LEDB, HIGH);
    
    oled.begin();
    oled.setFlipMode(2);
    oled.setFont(u8x8_font_chroma48medium8_r);
    oled.setCursor(0, 0);
    oled.print(" XIAO Sense KWS");
}

/**
 * @brief      turn_off_leds function - turn-off all RGB LEDs
 */
void turn_off_leds(){
    digitalWrite(LEDR, HIGH);
    digitalWrite(LEDG, HIGH);
    digitalWrite(LEDB, HIGH);
}

/**
 * @brief      Show Inference Results on OLED Display
 */
void display_oled(int pred_index, float pred_value){
  switch (pred_index){
    case 0:
      turn_off_leds();
      digitalWrite(LEDG, LOW);
      lbl = "IESTI  " ;
      break;

    case 1:
      turn_off_leds();
      lbl = "SILENCE";
      break;
    
    case 2:
      turn_off_leds();
      digitalWrite(LEDR, LOW);
      lbl = "UNIFEI ";
      break;
  }
    oled.setCursor(0, 2);
    oled.print("      ");
    oled.setCursor(2, 4);
    oled.print("Label:");
    oled.print(lbl);
    oled.setCursor(2, 6);
    oled.print("Prob.:");
    oled.print(pred_value);
}

void loop() {
    for (int i = 0; i < NUMBER_CLASSES; i++) { 
      pred_index = i;     
      pred_value = 0.8;   
      display_oled(pred_index, pred_value);
      delay(2000);
    }
}

Running the above code, you should get the below result:

Now, you should merge the above code (Initialization and functions) with the nano_ble33_sense_microphone_continuous.ino that you have used initially for testing your model. Also, you should include the below code on loop(), between the lines:

ei_printf(": \n");
...
#if EI_CLASSIFIER_HAS_ANOMALY == 1

And replacing the original function to print inference results on the Serial Monitor:

int pred_index = 0;     // Initialize pred_index
float pred_value = 0;   // Initialize pred_value
 
for (size_t ix = 0; ix < EI_CLASSIFIER_LABEL_COUNT; ix++) {
    ei_printf("    %s: %.5f\n", result.classification[ix].label, result.classification[ix].value);
    if (result.classification[ix].value > pred_value){
       pred_index = ix;
       pred_value = result.classification[ix].value;
    }
}
display_oled(pred_index, pred_value);

Here you can see how the final project is

The complete code can be found on my GitHub (3_KWS): Seeed-XIAO-BLE-Sense

Conclusion

The Seeed XIAO BLE Sense is really a giant tiny device! However, it is powerful, trustworthy, not expensive, low power, and has suitable sensors to be used on the most common embedded machine learning applications such as movement and sound. Even though Edge Impulse does not officially support XIAO BLE Sense (yet!), we also realized that it could use the Studio for training and deployment.

On my GitHub repository, you will find the last version of the codes at 3_KWS folder: Seeed-XIAO-BLE-Sense

Before we finish, take into consideration that Sound Classification is much more than just voice. For example, you can develop TinyML projects around sound in several areas as: