Published July 30, 2024 © MIT

Keyword Spotting with Synthetic Data - in Any Language

Keyword Spotting Using OpenAI Whisper and Edge Impulse, and deployment to Seeed Studio XIAO ESP32S3 (Sense)

IntermediateProtip1 hour1,144

Keyword Spotting with Synthetic Data - in Any Language

Things used in this project

Hardware components

Seeed Studio XIAO ESP32S3 Sense

Software apps and online services

Edge Impulse Studio

Story

Imagine you have a product that needs to react to a very specific keyword, and you're planning to sell millions of it worldwide. The problem is you don't speak French, Chinese, or Spanish. Hiring people to provide voice samples could be expensive and time-consuming. This is where generative AI comes in. In this article, I'll demonstrate how to use Edge Impulse to generate voice samples, train a keyword spotting model, and deploy it to the tiny ESP32-S3 based development board from Seeed Studio, XIAO ESP32S3 (Sense).

It's really much smaller than it looks in the picture

Below is the accompanying video for this article, make sure to watch it as well, as article and video are meant to complement each other (the video ends with deploying to Arduino RP2040 Connect, so this part is different and elaborated below).

All you need is?

To get started, we’ll use the Whisper API for Text to Speech. Whisper provides high-quality sound in various voices and supports multiple languages. First, create a new project in Edge Impulse. Once you’re in, head to the Data Acquisition tab and choose synthetic data. You will need either Pro Tier subscription or an Enterprise Plan - you can sign up for a free 14-day Enterprise Trial here. Additionally you will need to enter your OPENAI_API_KEY in Organization -> API Keys tab).

NB: There is no language selection here or in Whisper API. Instead you simply enter the words in the language of your choice - it is possible that in order for Whisper to recognize the language correctly you might need to enter a few words and then cut the unnecessary ones out - e.g. "Hospital (español)" for Spanish pronunciation of the word hospital.

Next, generate voice samples for the labels of your choice, I picked stop (停), forward (前进), back (撤销), left (左转), and right (右转) — commands suitable for a mobile robot platform. Enter each word, leave the other parameters at their default settings, and generate some samples. You’ll find that the generated samples are high quality—probably better than your own pronunciation if you’re not a native speaker.

Since we’re using Edge Impulse’s few-shot keyword spotting feature, we don’t need many samples. Generating about 50 samples for each label should be sufficient. For more robust results aim at 100 samples per class. After creating all the samples, rebalance the classes if necessary, making sure there is an equal amount of samples in each class.

Additionally - very important - you need an "unknown" class, that includes other words and background noise. You can either use samples from or also generate them using ElevenLabs Syntethic Sounds Generator.

Pick MFE as DSP block and Transfer Learning (Keyword Spotting) as the Learning Block.

Choose a smaller model (alpha 0.1) and reduce validation percentage to 0.2 - for other training parameters, you can leave them on default. You should get around 90% accuracy, both on validation and testing.

Before you start with deployment, make sure you have XIAO ESP32S3 (Sense) set up in Arduino IDE, following this wiki article. It is advised to use 2.x version of Arduino core for ESP32 and not 3.x.

For deployment, Edge Impulse we can use Arduino Library deployment option (find it using search function in Deployment tab). Download the Arduino library, then open the microphone example sketch using the Arduino IDE (go to Examples -> name of your project in Edge Impulse -> esp32 -> esp32_microphone). We will need to slightly modify the sketch to fit the specifics of XIAO ESP32S3 (Sense).

You can find the sketch in the article attachments, main the pin_config and i2s_port_t were modified. The instructions fro microphone sketch modification were taken from this forum post.

Additionally, you want to enable PSRAM in the Arduino IDE Tools menu and (only if you use EON Compiler option for deployment), possibly you will need to increase the EI_MAX_OVERFLOW_BUFFER_COUNT in /Arduino/libraries/[name-of-your-project]/src/edge-impulse-sdk/porting, you can find the correct value by a bit of trial and error. The last point is necessary, because Arduino library deployment is not geared specifically for ESP32-S3 and the arena size needed to utilize ESP32 neural network optimizations is slightly larger than the default one.

After that, upload the sketch and try it out!

Arduino sketch for XIAO ESP32S3 Microcphone

/* Edge Impulse Arduino examples
 * Copyright (c) 2022 EdgeImpulse Inc.
 *
 * Permission is hereby granted, free of charge, to any person obtaining a copy
 * of this software and associated documentation files (the "Software"), to deal
 * in the Software without restriction, including without limitation the rights
 * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 * copies of the Software, and to permit persons to whom the Software is
 * furnished to do so, subject to the following conditions:
 *
 * The above copyright notice and this permission notice shall be included in
 * all copies or substantial portions of the Software.
 *
 * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
 * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
 * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
 * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
 * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
 * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
 * SOFTWARE.
 */

// These sketches are tested with 2.0.4 ESP32 Arduino Core
// https://github.com/espressif/arduino-esp32/releases/tag/2.0.4

// If your target is limited in memory remove this macro to save 10K RAM
#define EIDSP_QUANTIZE_FILTERBANK   0

/*
 ** NOTE: If you run into TFLite arena allocation issue.
 **
 ** This may be due to may dynamic memory fragmentation.
 ** Try defining "-DEI_CLASSIFIER_ALLOCATION_STATIC" in boards.local.txt (create
 ** if it doesn't exist) and copy this file to
 ** `<ARDUINO_CORE_INSTALL_PATH>/arduino/hardware/<mbed_core>/<core_version>/`.
 **
 ** See
 ** (https://support.arduino.cc/hc/en-us/articles/360012076960-Where-are-the-installed-cores-located-)
 ** to find where Arduino installs cores on your machine.
 **
 ** If the problem persists then there's not enough memory for this model and application.
 */

/* Includes ---------------------------------------------------------------- */
#include <whisper-keyword-spotting_inferencing.h>

#define VOLUME_GAIN 2

#include "freertos/FreeRTOS.h"
#include "freertos/task.h"

#include "driver/i2s.h"

/** Audio buffers, pointers and selectors */
typedef struct {
    int16_t *buffer;
    uint8_t buf_ready;
    uint32_t buf_count;
    uint32_t n_samples;
} inference_t;

static inference_t inference;
static const uint32_t sample_buffer_size = 2048;
static signed short sampleBuffer[sample_buffer_size];
static bool debug_nn = false; // Set this to true to see e.g. features generated from the raw signal
static bool record_status = true;

/**
 * @brief      Arduino setup function
 */
void setup()
{
    // put your setup code here, to run once:
    Serial.begin(115200);
    // comment out the below line to cancel the wait for USB connection (needed for native USB)
    while (!Serial);
    Serial.println("Edge Impulse Inferencing Demo");

    // summary of inferencing settings (from model_metadata.h)
    ei_printf("Inferencing settings:\n");
    ei_printf("\tInterval: ");
    ei_printf_float((float)EI_CLASSIFIER_INTERVAL_MS);
    ei_printf(" ms.\n");
    ei_printf("\tFrame size: %d\n", EI_CLASSIFIER_DSP_INPUT_FRAME_SIZE);
    ei_printf("\tSample length: %d ms.\n", EI_CLASSIFIER_RAW_SAMPLE_COUNT / 16);
    ei_printf("\tNo. of classes: %d\n", sizeof(ei_classifier_inferencing_categories) / sizeof(ei_classifier_inferencing_categories[0]));

    ei_printf("\nStarting continious inference in 2 seconds...\n");
    ei_sleep(2000);

    if (microphone_inference_start(EI_CLASSIFIER_RAW_SAMPLE_COUNT) == false) {
        ei_printf("ERR: Could not allocate audio buffer (size %d), this could be due to the window length of your model\r\n", EI_CLASSIFIER_RAW_SAMPLE_COUNT);
        return;
    }

    ei_printf("Recording...\n");
    ei_printf("%d\n", EI_MAX_OVERFLOW_BUFFER_COUNT);
}

/**
 * @brief      Arduino main function. Runs the inferencing loop.
 */
void loop()
{
    bool m = microphone_inference_record();
    if (!m) {
        ei_printf("ERR: Failed to record audio...\n");
        return;
    }

    signal_t signal;
    signal.total_length = EI_CLASSIFIER_RAW_SAMPLE_COUNT;
    signal.get_data = &microphone_audio_signal_get_data;
    ei_impulse_result_t result = { 0 };

    EI_IMPULSE_ERROR r = run_classifier(&signal, &result, debug_nn);
    if (r != EI_IMPULSE_OK) {
        ei_printf("ERR: Failed to run classifier (%d)\n", r);
        return;
    }

    // print the predictions
    ei_printf("Predictions ");
    ei_printf("(DSP: %d ms., Classification: %d ms., Anomaly: %d ms.)",
        result.timing.dsp, result.timing.classification, result.timing.anomaly);
    ei_printf(": \n");
    for (size_t ix = 0; ix < EI_CLASSIFIER_LABEL_COUNT; ix++) {
        ei_printf("    %s: ", result.classification[ix].label);
        ei_printf_float(result.classification[ix].value);
        ei_printf("\n");
    }
#if EI_CLASSIFIER_HAS_ANOMALY == 1
    ei_printf("    anomaly score: ");
    ei_printf_float(result.anomaly);
    ei_printf("\n");
#endif
}

static void audio_inference_callback(uint32_t n_bytes)
{
    for(int i = 0; i < n_bytes>>1; i++) {
        inference.buffer[inference.buf_count++] = sampleBuffer[i];

        if(inference.buf_count >= inference.n_samples) {
          inference.buf_count = 0;
          inference.buf_ready = 1;
        }
    }
}

static void capture_samples(void* arg) {

  const int32_t i2s_bytes_to_read = (uint32_t)arg;
  size_t bytes_read = i2s_bytes_to_read;

  while (record_status) {

    /* read data at once from i2s */
    i2s_read((i2s_port_t)0, (void*)sampleBuffer, i2s_bytes_to_read, &bytes_read, 100);

    if (bytes_read <= 0) {
      ei_printf("Error in I2S read : %d", bytes_read);
    }
    else {
        if (bytes_read < i2s_bytes_to_read) {
        ei_printf("Partial I2S read");
        }

        // scale the data (otherwise the sound is too quiet)
        for (int x = 0; x < i2s_bytes_to_read/2; x++) {
            sampleBuffer[x] = (int16_t)(sampleBuffer[x]) * 4;
        }

        if (record_status) {
            audio_inference_callback(i2s_bytes_to_read);
        }
        else {
            break;
        }
    }
  }
  vTaskDelete(NULL);
}

/**
 * @brief      Init inferencing struct and setup/start PDM
 *
 * @param[in]  n_samples  The n samples
 *
 * @return     { description_of_the_return_value }
 */
static bool microphone_inference_start(uint32_t n_samples)
{
    inference.buffer = (int16_t *)malloc(n_samples * sizeof(int16_t));

    if(inference.buffer == NULL) {
        return false;
    }

    inference.buf_count  = 0;
    inference.n_samples  = n_samples;
    inference.buf_ready  = 0;

    if (i2s_init(EI_CLASSIFIER_FREQUENCY)) {
        ei_printf("Failed to start I2S!");
    }

    ei_sleep(100);

    record_status = true;

    xTaskCreate(capture_samples, "CaptureSamples", 1024 * 32, (void*)sample_buffer_size, 10, NULL);

    return true;
}

/**
 * @brief      Wait on new data
 *
 * @return     True when finished
 */
static bool microphone_inference_record(void)
{
    bool ret = true;

    while (inference.buf_ready == 0) {
        delay(10);
    }

    inference.buf_ready = 0;
    return ret;
}

/**
 * Get raw audio signal data
 */
static int microphone_audio_signal_get_data(size_t offset, size_t length, float *out_ptr)
{
    numpy::int16_to_float(&inference.buffer[offset], out_ptr, length);

    return 0;
}

/**
 * @brief      Stop PDM and release buffers
 */
static void microphone_inference_end(void)
{
    i2s_deinit();
    ei_free(inference.buffer);
}


static int i2s_init(uint32_t sampling_rate) {
  // Start listening for audio: MONO @ 8/16KHz
  i2s_config_t i2s_config = {
      .mode = (i2s_mode_t)(I2S_MODE_MASTER | I2S_MODE_RX | I2S_MODE_PDM ),
      .sample_rate = sampling_rate,
      .bits_per_sample = (i2s_bits_per_sample_t)16,
      .channel_format = I2S_CHANNEL_FMT_ONLY_LEFT,
      .communication_format = I2S_COMM_FORMAT_I2S,
      .intr_alloc_flags = ESP_INTR_FLAG_LEVEL1,
      .dma_buf_count = 8,
      .dma_buf_len = 512,
      .use_apll = false,
      .tx_desc_auto_clear = false,
      .fixed_mclk = 0,
  };
  i2s_pin_config_t pin_config = {
      .bck_io_num = -1,    // IIS_SCLK
      .ws_io_num = 42,     // IIS_LCLK
      .data_out_num = -1,  // IIS_DSIN
      .data_in_num = 41,   // IIS_DOUT
  
  };
  esp_err_t ret = 0;

  ret = i2s_driver_install((i2s_port_t)0, &i2s_config, 0, NULL);
  if (ret != ESP_OK) {
    ei_printf("Error in i2s_driver_install");
  }

  ret = i2s_set_pin((i2s_port_t)0, &pin_config);
  if (ret != ESP_OK) {
    ei_printf("Error in i2s_set_pin");
  }

  ret = i2s_zero_dma_buffer((i2s_port_t)0);
  if (ret != ESP_OK) {
    ei_printf("Error in initializing dma buffer with 0");
  }

  return int(ret);
}

static int i2s_deinit(void) {
    i2s_driver_uninstall((i2s_port_t)0); //stop & destroy i2s driver
    return 0;
}

#if !defined(EI_CLASSIFIER_SENSOR) || EI_CLASSIFIER_SENSOR != EI_CLASSIFIER_SENSOR_MICROPHONE
#error "Invalid model for current sensor."
#endif

Keyword Spotting with Synthetic Data - in Any Language

Things used in this project

Hardware components

Software apps and online services

Story

Code

Arduino sketch for XIAO ESP32S3 Microcphone

Credits

Dmitry Maslov

Comments

Embed the widget on your own site

Keyword Spotting with Synthetic Data - in Any Language

Keyword Spotting with Synthetic Data - in Any Language

Things used in this project

Hardware components

Software apps and online services

Story

Code

Arduino sketch for XIAO ESP32S3 Microcphone

Credits

Dmitry Maslov

Comments

Related channels and tags