I have always been fascinated by how Google Assistant begins listening the moment we say "Okay Google, " or "Hey Siri" on Apple devices. This inspiration led me to explore how wake word detection can be incorporated into embedded devices. Working on this project also made me aware of the necessary optimizations required to run on SBCs such as the Raspberry Pi. I have documented the entire process of building my wake word recognition model, which I used for image capturing and displaying the time on an OLED display when the wake word is detected.
1. Hardware Setup: Setting up INMP441 Microphone to the raspberry piI used an INMP441 MEMS microphone for audio acquisition and inference on the Raspberry Pi 5. The connection process is fairly simple, utilizing the standard SPI pins of the Raspberry Pi. Below are the wiring details:
1. VDD(INMP441) --> 3.3V(rpi)
2. GND(INMP441) --> GND(rpi)
3. SCK(INMP441) --> PCM_CLK(rpi)
4. WS(INMP441) --> PCM_FS(rpi)
5. SD(INMP441) --> PCM_DIN(rpi)
6. (optional) L/R(INMP441) --> GND for left channel and VDD for right channel.
Although I initially used a pair of INMP441 microphones for stereo audio, I noticed a slight amount of noise in the left channel that did not clear up even after changing the jumper and software cables. Therefore, I decided to proceed using only the right channel for audio acquisition and inference.
Since I am building a wake word detection system, it is crucial that my model recognizes only my specified wake word and ignores all other words or background noise—essentially functioning as a "one vs. many" classifier. To achieve this, I have recorded several samples of my wake word, as well as samples of other random words.
The wake word chosen for my project is "raspi, " which is apt since I am using a Raspberry Pi, and the audio will also have variations in amplitude that help the classifier uniquely identify the wake word. Audio acquisition can be performed using the INMP441 connected to the Pi or through a PC's built-in microphone.
Below is a Python script that facilitates the audio acquisition process by capturing 2-second audio samples and storing them in the respective directories:
import sounddevice as sd
from scipy.io.wavfile import write
import os
def record_audio(duration=2, fs=16000, file_prefix="raspi"):
"""
Records audio for a given duration and sample rate, saving it with an iterative filename.
"""
# Ensure the 'recordings' directory exists
if not os.path.exists('ww'):
os.makedirs('ww')
# Get the next file number
files = os.listdir('ww')
count = len([f for f in files if f.startswith(file_prefix) and f.endswith('.wav')])
# Record the audio
print(f"Recording {file_prefix}_{count+1}.wav")
input("press Enter")
recording = sd.rec(int(duration * fs), samplerate=fs, channels=1, dtype='int16')
sd.wait() # Wait until recording is finished
# Save the recording
write(f'ww/{file_prefix}_{count+1}.wav', fs, recording)
print(f"Saved {file_prefix}_{count+1}.wav")
# Example usage: Record 50 samples
for _ in range(50):
record_audio()
For the script mentioned above, I have chosen a sample rate of 16 kHz as an optimization for running on the Raspberry Pi. The standard sampling rate of 44, 100 Hz also works fine when executing on a laptop or PC, but it tends to be less responsive on the Pi.
I have collected approximately 100 samples of my wake word 'raspi' and about 200 samples of other random words, which includes 45 samples of empty background audio. Additionally, I've recorded some samples of continuous speech, such as myself reading a book, to ensure the model does not falsely classify the wake word if it appears in running speech.
The wake word samples and non-wake word samples are stored in two separate folders to facilitate easy bifurcation during the preprocessing stage.
Optional(For the Raspberry pi)The arecord command on the raspberrypi can also be used for taking test samples on the pi. Below is a example command that takes a single sample of 2 seconds.
$ arecord -D plughw:2 -c1 -r 16000 -f S32_LE -t wav -d 2 -V stereo -v file_name.wav
Explanation:
- -D plughw:2: Specifies the hardware device with card number 2. use 'arecord -l' to check the card number of your hardware device.
- -c1: Sets the number of channels to 1 as it will be easier to process for the wake word model.
- -r 16000: Sets the sample rate to 16, 000 Hz.
- -f S32_LE: Sets the format to 32-bit Little Endian.
- -t wav: Specifies the file type to be WAV.
- -d 2: Records the audio for 2 seconds.
- -V stereo: Enables stereo VU-meter during recording.
- -v: Enables verbose mode, providing detailed information about the recording process.
- file_name.wav: The name of the file to save the recording to.
Among the 300 total samples acquired, an 80/20 split is performed for training and test datasets. 20% of the training data is used as a validation set to evaluate validation loss so I do not overfit or underfit the model.
4. PreprocessingFor preprocessing and extracting MFCC (Mel Frequency Cepstral Coefficients) from the.wav audio files, I have chosen to use the Librosa Python library.
Extracting 13 features has proven to be quite optimal for the wake word detection model. I also experimented with reconfiguring the number of Fast Fourier Transforms (FFT) and the window size to achieve optimum performance, as these parameters must be consistent during live inferencing. Setting the n_fft
to 1024 and the hop_length
to 512 allowed for a smooth transition and quick response from the Raspberry Pi. Initially, the n_fft
was set to 2048, which worked well for inferencing on my computer but was less effective on the Pi.
Below is a code snippet that extracts the 13 features from a single sample input:
def extract_mfcc(file_path, n_mfcc=13, n_fft=1024, hop_length=512, sr=16000):
# Load audio file
y, sr = librosa.load(file_path, sr=sr) # Load with the specified sampling rate
# Extract MFCCs
mfccs = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=n_mfcc, n_fft=n_fft, hop_length=hop_length)
# Mean normalization across time
mfccs = np.mean(mfccs, axis=1)
return mfccs
The MFCC spectrogram shown below indicates the cepstral coefficients w.r.t time for the wake word and the non-wake word features. Upon observing the spectrogram, a noticeable change can be seen for the both the features for time > 1 second. A large variation in coefficients can be observed for the wake word sample between time 0.5 to 1 second. This might probably be the time I spoke my wake word.
Since the input data is time-series specifically, the detection of the word 'raspi' involves understanding the sequence in which phonemes occur. This makes it logical to implement a Recurrent Neural Network (RNN) for handling this type of sequential data.
Given that we have extracted 13 MFCCs during the preprocessing stage, the training phase involved the model to learn the sequence of the features and trains itself accordingly.
Below is the RNN model that is used for the wake word detection model. It is fairly a simple model configured with 64 hidden layers and 1 RNN layer, suitable for our requirement.
class RNNModel(nn.Module):
def __init__(self, input_size, hidden_size, num_layers, num_classes):
super(RNNModel, self).__init__()
self.hidden_size = hidden_size
self.num_layers = num_layers
self.rnn = nn.RNN(input_size, hidden_size, num_layers, batch_first=True)
self.fc = nn.Linear(hidden_size, num_classes)
def forward(self, x):
# Set initial hidden and cell states
h0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size).to(x.device)
# Forward propagate the RNN
out, _ = self.rnn(x, h0)
# Pass the output of the last time step
out = self.fc(out[:, -1, :]).squeeze()
return out
I also came across Long Short Term Memory (LTSM) models but for the sake of simplicity, I decided to go further with the RNN model.
The model is trained for 100 epochs with a learning rate of 1%. Below is the initial and final loss results. Increasing the epochs also appears to reduce the training loss but the model appears to overfit as the wake word gets detected even when other random words are spoken.
Epoch [1/100], Train Loss: 0.6779, Validation Loss: 0.6612
Epoch [2/100], Train Loss: 0.6429, Validation Loss: 0.6435
Epoch [3/100], Train Loss: 0.6323, Validation Loss: 0.6438
Epoch [4/100], Train Loss: 0.6190, Validation Loss: 0.6391
Epoch [5/100], Train Loss: 0.6171, Validation Loss: 0.6391
Epoch [6/100], Train Loss: 0.6141, Validation Loss: 0.6406
Epoch [7/100], Train Loss: 0.6095, Validation Loss: 0.6376
Epoch [8/100], Train Loss: 0.6027, Validation Loss: 0.6343
Epoch [9/100], Train Loss: 0.6089, Validation Loss: 0.6330
Epoch [10/100], Train Loss: 0.6023, Validation Loss: 0.6294
..........
..........
Epoch [90/100], Train Loss: 0.2749, Validation Loss: 0.3497
Epoch [91/100], Train Loss: 0.2772, Validation Loss: 0.3436
Epoch [92/100], Train Loss: 0.2742, Validation Loss: 0.3314
Epoch [93/100], Train Loss: 0.2642, Validation Loss: 0.3419
Epoch [94/100], Train Loss: 0.2678, Validation Loss: 0.3225
Epoch [95/100], Train Loss: 0.2603, Validation Loss: 0.3312
Epoch [96/100], Train Loss: 0.2569, Validation Loss: 0.3228
Epoch [97/100], Train Loss: 0.2555, Validation Loss: 0.3223
Epoch [98/100], Train Loss: 0.2536, Validation Loss: 0.3280
Epoch [99/100], Train Loss: 0.2540, Validation Loss: 0.3197
Epoch [100/100], Train Loss: 0.2428, Validation Loss: 0.3112
6. Model TestingAfter training the model for 100 epochs, I was able to achieve an overall accuracy of 91.6%. The testing accuracy can be improved by increasing the epochs but it overfits the model so detection during live inferencing can be problematic. After tweaking the number of epochs and learning rates, this was the best result I obtained.
Here is the process on how the live inferencing is performed:i. An audio frame of an instant is acquired.ii. The audio frame is appended in the buffer along with the frames captured during previous timesteps.iii. The buffer size of 2(seconds)*16(KHz) are processed further by computing and averaging its MFCCs.iv. The features are then loaded to the pretrained model and the predicted probability is computed using the sigmoid function.v. The same process as in step (i) is repeated and the buffer keeps appending and computed just like a sliding window mechanism.The snippet below performs the audio frames contained in intdata for computation
def audio_callback(intdata,frames,time,status):
global audio_buffer
#print("Entered")
if status:
print("Error: ",status)
audio_buffer = np.roll(audio_buffer, -len(intdata))
audio_buffer[-len(intdata):] = intdata[:, 0]
#print(len(audio_buffer))
if(len(audio_buffer)>= buffer_size):
mfccs = librosa.feature.mfcc(y=audio_buffer, sr=sr, n_mfcc=n_mfcc, n_fft=n_fft, hop_length=hop_length)
mfccs = np.mean(mfccs, axis=1)
feature=torch.tensor([mfccs]).float().unsqueeze(0)
with torch.no_grad():
output=model(feature)
predicted_prob = torch.sigmoid(output).item()
print(f"{predicted_prob}")
Here is how the live accuracy shoots up when the wake word is detected
The wake word prediction is usually 20% when there is no background noise. The prediction usually exceeds 70% when the wake word is detected. The prediction probability also increases beyond 80% when the user speaks close to the microphone or when the user speaks loudly. Sometimes, when the user speaks a single word such as "hello", the prediction reaches 45%. Given the parameters, I have set the threshold to 65% for the wake word detection to be true.
8. Application-1: Time displayOne application where I have implemented the wake word model is a simple time display using an OLED display that shows the current time for a specified duration when the user speaks the wake word.
8.1. Setting up the OLED displayI have made use of an 128x64 pixel SSD1306 OLED display that uses I2C to interface with the pi. The OLED is connected to the standard I2C pins of the Raspberry Pi 5 as described below.
SDA(SSD1306) --> GPIO2(rpi)
SCL(SSD1306) --> GPIO3(rpi)
VCC(SSD1306) --> 3.3V(rpi)
GND(SSD1306) --> GND(rpi)
For i2c scan, use the following command:
$ sudo i2cdetect -y 1
For installation of necessary modules and setting up the SSD1306 to the raspberry pi, follow the steps mentioned in this article(link here).
8.2. Demo9. Application-2: Camera capture on wake wordInspired by how a GoPro starts and stops recording with the wake word "GoPro, " I wanted to experiment with capturing a picture when the wake word "raspi" is detected. For this purpose, the additional hardware required was a camera module. Specifically, I implemented the Raspberry Pi Camera Module 3.
Connecting the Camera Module 3 to the Raspberry Pi is fairly straightforward; it involves attaching the ribbon cables from the camera module to the CAM/DISP slot on the Raspberry Pi. I followed several online tutorials to get it operational.
To perform a test to check the camera, use the command below:
$ rpicam-hello
In-case you are using RDP to access the pi, the command will most lively throw an error(as checked on the forum here). In that case, use this command.
$ rpicam-hello --qt-preview
9.2 DemoIn the live inferencing demo, the camera module is facing the wall where I use different hand gestures during image capture. The camera captures image only when the wake works 'raspi' is spoken. Meanwhile, when other words such as 'hello' and 'porcupine' are ignored.
References1. Tutorials on building a wake word detection models:
The AI Hacker YouTube
Pritish Mishra YouTube Channel
2. Setting up INMP441 Microphone to Raspberry Pi 5: Here
3. Configuring the OLED display for the Raspberry Pi: Here
4. Raspberry Pi Camera module troubleshoot: Here
Comments
Please log in or sign up to comment.