Published January 2, 2024 © GPL3+

Keyword Spotting with the Arduino Nicla Voice

Build a machine learning model that identifies keywords, and run local inference on an Arduino Nicla Voice.

BeginnerFull instructions provided1 hour800

Keyword Spotting with the Arduino Nicla Voice

Things used in this project

Hardware components

Arduino Nicla Voice

Software apps and online services

Edge Impulse Studio

Story

Introduction

For many years now, devices capable of running machine learning models have gotten smaller and smaller, and the Arduino Nicla Voice and it's Vision and Sense siblings might be the tiniest boards yet. Pictures don't really do it justice; it is tiny.

But even with it's diminutive size, it's a VERY capable development board. The board features a Nordic Semi nRF52832 Arm Cortex M4 MCU, 64kb SRAM and 512kb Flash, Bluetooth LE connectivity, 10 Digital I/O pins, a 6-axis IMU sensor, 3-axis magnetometer, a microphone, and a Syntiant NDP120 Neural Decision Processor. The NDP120 gives the Nicla Voice the ability to perform ultra low-power audio detection and recognition for speech, words, phrases, noises, sounds, or other audio events.

In this tutorial, we'll use the Nicla to perform Keyword Spotting, which is a system that can recognize audible events, particularly your voice, through audio classification. As an example, this functionality is similar to "Hey Siri" or "OK, Google". It's essentially the ability to recognize keywords, even in the presence of other background noise or background chatter. And once the keyword is detected, then your use-case and application can take an action. This could be a physical interaction, such as opening or closing a door or turning on/off an appliance, or it could be an input to another application, such as sending a message, setting a reminder, completing a purchase, or other web-based services.

Hardware

The only hardware used in this project is the Arduino Nicla Voice, as we'll focus on the machine learning model portion, but you'll probably want to go a step further and trigger an action via the GPIO pins or over the BLE connectivity on the Nicla, so keep that in mind.

Software

Edge Impulse Studio will be used for building the audio classification model, and creating a firmware to test the ML inference capability.

Collect a Dataset

As with other Edge Impulse projects, and most machine learning projects in general, we'll begin with collecting data to use in the creation of our model. The data samples that we collect are used to train our model, which should be then able to predict future, unseen data that matches our desired state, sound, or image. Many Hackster tutorials cover the process of collecting data directly from a sensor, microphone, or camera, but there are several other methods for bringing a dataset into Edge Impulse as well: uploading data files from a local folder, connecting to an AWS S3 bucket or Google Cloud location with data samples, or simply cloning an existing Studio project that already has a dataset included.

We'll clone an existing project in this tutorial, as it's not as often documented, and is a great way to get started fast.

To begin, log in to your Edge Impulse Studio account, then navigate to this ready-made project: https://studio.edgeimpulse.com/public/42868/latest. This project contains a subset of the Tensorflow Speech Commands dataset already uploaded, in particular the "Stop" and "Go" keywords, along with background noise and other random words. The machine learning blocks are already setup as well, which we'll get to in just a bit.

To begin, at the top-right corner, click the "Clone this project" button.

Upon completion, you will have a duplicate of the project in your Studio, ready for use. If you then click on "Data acquisition" on the left, you can see that you have nearly 3, 000 recordings of the word "Stop" and another nearly 3, 000 recordings of the word "Go". If you click on individual samples, you can play back the audio recording, and you'll notice they vary in length, tone, dialect, speaker, gender, etc. This wide variety of data helps build a robust model. If for example, we only had my own voice saying the word "Stop" 3, 000 times, we run the risk of building a model that **ONLY** recognizes my voice, and no one else's. There is also a "z_openset" class, which is made up of background noise, random words, and other sounds.

To visualize our dataset, click on "Data explorer" above the "Data collected" pie-chart. You can see a clustered representation of the audio samples. There are a few outliers, which might need a bit of investigation to make sure they are labeled correctly or give them a listen to ensure accuracy...but overall the data is clustered quite well.

So, by simply cloning an existing project, we got nearly 3 hours of audio to use for building our model, without recording a single sample! (You can of course continue to add data, by recording your own voice or that of others saying "Go" or "Stop" using the Studio or other methods).

Build a Model

The next step is to use the data we collected (well, borrowed in this case!) to construct our machine learning algorithm. Here again, thanks to Cloning an existing project, much of this work is already done for us. Let's walk through the steps just to outline the process though.

First, click on "Impulse design" on the left. The model training pipeline will load, and you'll notice that instead of having to choose your Processing and Learning blocks like usual, they will instead already be selected for you, as they were copied over from the original project. You'll see that the audio is recognized as Time-series data, and the windows size and increase are already set. The "Audio (Syntiant)" Processing block is chosen, and the Learning block is a Keras Neural Network classifier. Finally, the Output features are the 3 classes of data in the dataset, "Go", "Stop", and the "Other" category (called z_openset). Click on Save Impulse once you are ready to proceed.

On the left, click on "Syntiant" and the Feature Generation page will load. The Parameters should be already set, but double check that "log-bin (NDP120/200)" is selected in the Feature Extractor near the bottom of the page, then click "Save parameters". On the next page, you can click on "Generate features" to build a visual representation of the features extracted from the dataset, and once again you should find nice clustering of the three classes, though some overlap will likely occur.

With those steps complete, it is time to begin the actual training of the machine learning model. Click on "NN Classifier" on the left navigation. Here you will have all of the Neural Network settings available, though the default selections will likely work fine for training. At the bottom of the page, click the "Start training" button.

It will take a few minutes to build our model, but upon completion you will see an Accuracy and classification results on a validation set of data.

You can test your model again some live data at this point if you'd like, or against a Test set of unseen data that the Studio set aside, but I'll skip those and instead place the model onto the Nicla to see how it performs.

Deploy to the Nicla

To load your new keyword spotting model onto the Nicla Voice, there are two options. The first one is to generate and flash a ready-to-run binary firmware on to the board, which is what we will do in this demo, and the second method is to generate a Syntiant NDP Arduino library that you can include in your own custom application and code. That method is more flexible, and will be needed to take an "action" when the keywords are identified, such as triggering a BLE communication, toggling a GPIO pin high or low, or other functionality to interact with the surrounding environment. For testing purposes to ensure the model works though, the ready-made firmware method is quick and easy.

To begin, click on "Deployment" on the left navigation, and start typing in the word "Nicla" into the search box. The Nicla Voice can be selected, and then click the "Build model" button at the bottom. It will take a few minutes to go through the build process, but once finished a `.zip` file will be automatically downloaded.

Unzip the file, and inside you will find the firmware as well as flashing utilities for Windows, Mac, and Linux. Attach the Nicla Voice to your computer via USB, and then run the command corresponding to your development machine's OS. I am using an Ubuntu VM, so I ran ./flash_linux.sh

Once the utility writes the binary to the board, we can run our model directly on the device by running edge-impulse-run-impulse with the USB cable still attached to the Nicla. (You will need to install the Edge Impulse CLI in order to have that command, follow the instructions here if you need to install it.)

With the inference running, simply say "Stop" or "Go" into the Nicla, and you should receive predictions in real-time in the CLI.

Next Steps

At this point, we have gotten our Keyword Spotting model built and proven that it works by running it directly on the Nicla Voice! But saying "Go" or "Stop" and receiving confirmation in the CLI is just the beginning. Next you'll want to go back to the Edge Impulse Studio and click on "Deployment" once again, and this time select the Syntiant NDP Arduino library. Run the build and wait for the package to be generated, and you will once again receive a.zip file that will automatically be downloaded. You can then integrate this library into your own applications and code, in order to perform the inference but then take appropriate actions based on your use-case. Like mentioned earlier, this could be triggering data communications over BLE to ultimately send an email, text message, or other alert via another system, perhaps it's opening or closing gates, valves, doors, locks, etc, or maybe you need to turn on (or off) a relay in order to power a device. With the expansion and communication protocols available on the board, nearly anything is possible!