Abstract:
The project's aim was to solve a pattern recognition task from a raw time-domain signal. We employed the Sparkfun RedBoard Artemis ATP module and the integrated MEMS microphone to record and classify environmental sounds. In this project summary, we present a simple pipeline for beginners to train and deploy a simple Neural Network (NN) and advanced methods as well that can be used to improve model performance. We hope that the experience collected during this challenge can be utilized in our gunshot detector designed for elephants.
Introduction:
Neural networks used for sound classification usually interpret their inputs as images. This is done by calculating the 2D spectrogram of the raw audio recording. However, there are situations when the spectrogram conversion leads to the loss of relevant information. One example is gunshot detection, where the ballistic shockwave sound has such a unique shape (resembles a capital letter N) that detectors based on this raw signal shape work more accurately than spectrogram based solutions. Our idea comes from this scenario because there may be many possible events that have a particular shape in the time domain.
In our case, these unique shaped audio signals were generated by sparks. The spark is an abrupt electrical discharge that produces a brief emission of light and a sharp crack or snapping sound. This sound contains very high frequencies and has a short length in time (around 4 ms). This spark acoustic event can be recorded by the MEMS microphone integrated on the Redboard Artemis ATP, and an example recording is illustrated in Figure 1.
The recorded spark noises don't have exactly the same shape, but all of them contain several spikes with similar lengths. These similarities should be understood by a NN to perform the detection task.
Goals, experimental setup, and data collection:
In summary, we built a classifier that can detect spark noises. To achieve this goal, we have collected spark sounds with different impulsive background noises, by employing a loudspeaker, a spark generator, and the RedBoard Artemis as the data collector. The background noises help to generalize the knowledge of the detector. and make the detection task harder.
The used background noises were: car horn, spoken digits, dog barking, gauss noise, gunshots, jackhammer, various music, siren, silence.
The basic pipeline was the following:
- Record sparks with different background noises produced by a speaker
- Record only background noises as negative samples
- Collect these recordings into a dataset with binary labels – 0: no spark; 1: contains a spark
- Train a simple model and deploy it on the Sparkfun Redboard Artemis ATP
- Train various models with advanced methods involved
- Evaluate models
The data collection setup contained the RedBoard Artemis as the recording device. An extra device, an Arduino Due controlled a relay that let through high current to produce sparks with a DC-DC Booster. The whole procedure was synchronized by a PC, which also playbacked various background noises from a loudspeaker. The setup is illustrated in Figure 2. The RedBoard Artemis ATP recorded the superposition of the background noise and the spark sound. One such combined recording is illustrated in Figure 3, where car_horn
noise was produced during the measurement. The impulsive region in the middle of the recording can be spotted, which corresponds to the spark sound. Within the recordings, the locations of the sparks vary to prevent over-fitting to a specific location.
The resulting dataset contained:
From all classes, 100 samples were added to the training, 30 samples to the test and 20 samples to the validation sets. The remaining samples were left out for possible future work directions mentioned later.
The model is initially fit on a training dataset that is a set of examples used to fit the parameters of the model.
Successively, the fitted model is used to predict the responses for the observations in a second dataset called the validation dataset. The validation dataset provides an unbiased evaluation of a model fit on the training dataset while tuning the model's hyperparameters.
Finally, the test dataset is a dataset used to provide an unbiased evaluation of a final model fit on the training dataset
In the following, we will describe a simple pipeline for beginners that will result in a deployable model. Later, more advanced techniques will be presented that help to improve accuracy and robustness.
Simple pipeline:
In this section, we cover the major steps that are required to train and deploy a baseline neural network architecture. The methods and results are limited, but this could serve as a good starting point for further improvements and a stable initial project that can be upgraded.
The major steps that will be presented in detail with additional example codes:
- Data collection on the RedBoard Artemis ATP with the MEMS microphone
- Baseline model training and conversion to a quantized model.
- Model deployment and inference.
[Data collection details] -> [Training on a GPU] -> [Model deployment] -> [Inference]
Let's start with the data collection! In the previous section, we already presented the measurement setup. The interesting code related to the measurements is the audio recording on the RedBoard Artemis ATP through the PDM interface. You will need the AmbiqSuite-R2.3.2
SDK with the SparkFun board BSP files included. You can get the SparkFun extension from here:
https://github.com/sparkfun/SparkFun_Apollo3_AmbiqSuite_BSPs
Just move the boards_sfe folder next to the AmbiqSuite-R2.3.2/boards
folder.
We started from the AmbiqSuite-R2.3.2/boards_sfe/common/examples/pdm_fft
example code. This code collects audio and computes the recorded signal's Fourier Transform continuously. It also processes this information, then transmits it to the user through the serial port. Our code will implement the following:
- PDM interface initialization with a reduced sampling frequency (11718 Hz)
- UART initialization
- Waiting for a command from the user: 'r' represents the recording command
- If 'r' was received, start recording 12000 samples (around 1 second)
- Once finished, send the data to the user at 1 MB/sec speed.
- Waiting for another command, and so on...
We included the source code with a bunch of comments, so interested readers can go through and understand the details. The interesting part of the code implements the functionality explained above:
/*waiting for a trigger from the PC*/
while(am_bsp_com_uart_transfer(&transfer_config) != AM_HAL_STATUS_SUCCESS)
;
/*if the received character is an 'r', start a ~1 second long recording*/
if(readBuffer[0] == 'r') {
am_devices_led_on(am_bsp_psLEDs, 0);
g_bPDMDataReady = false;
am_hal_pdm_fifo_flush(PDMHandle);
/*start data collection by utilizing the DMA*/
pdm_data_get();
/*go to sleep*/
am_hal_sysctrl_sleep(AM_HAL_SYSCTRL_SLEEP_DEEP);
/*wake-up trigger from the DMA*/
while(!g_bPDMDataReady)
;
am_devices_led_off(am_bsp_psLEDs, 0);
/*send the data through the UART*/
am_bsp_com_uart_transfer(&transfer_config_writebuffer);
}
With this application, we could perform data collection triggered by the PC. We also implemented the PC-side data collector in Python, which utilized the PySerial module to transmit commands and to receive the recorded data as a response.
The Python program also sent commands to an Arduino Due board, which controlled a relay to generate sparks. Before sending the 'recording' and 'spark generation' commands, the PC program had started playing a randomly chosen audio file on a loudspeaker. These clips served as the background noises explained earlier.
A video recording about the data collection experiment can be accessed here:
[Data collection details] -> [Training on a GPU] -> [Model deployment] -> [Inference]
Once we had collected enough data, we could implement our first, simple training process, which produced a baseline NN model.
We have added the source code of the training process to this project, but it is also available as a Python notebook here: Training notebook. Google Colaboratory is a great place for beginners to test out their ideas in a controlled environment with freely available GPUs.
This notebook contains the main steps of model training, which include:
- Data loading: positive and negative examples
- Data separation into train, validation and test sets: 100 + 20 + 30 samples
- Neural network model creation: simple convolutional neural network
- Model fitting - training: default training parameters
- Model evaluation: evaluated on the test dataset
- Model conversion to TensorFlow Lite model.
- Model conversion to a byte array, which can be uploaded to the Artemis Board.
The model used in this example consists of a convolutional layer with 2 kernels and a maximum pooling layer. that covers the whole feature vectors produced by the convolutions. With its simplicity, this model could hardly generalize knowledge, but it could achieve high accuracy of around 94% on the test dataset, which is in the acceptable range. The architecture is illustrated in Figure 4. (Note: Conv2D is used, because the TF Lite Micro only supports this operation, otherwise Conv1D would be required.)
The trained model had pros and cons:
+ small size, only 71 trainable parameters
+ can handle the 12000 samples long input
+ the Artemis board can run it within 1 sec
- very sensitive to noise, cannot generalize
[Data collection details] -> [Training on a GPU] -> [Model deployment] -> [Inference]
The deployment starts from the byte array, which was generated in the previous section during training. To implement an application that continuously records and processes audio signals, we started from the micro_speech example application found in TensorFlow Lite Micro examples. This example code collects audio signals and tries to detect and classify 'yes' and 'no' spoken keywords. We changed the following parts of the code:
- Data acquisition: 12000 sample long buffers were filled at a sampling rate of 12 kHz
- The model structure: the model trained earlier was used
- Detection response: UART communication instead of LED blinking
The TensorFlow library does not support the Artemis boards, but there is a repository in which the porting has already started, and e.g. the micro_speech example can be compiled and uploaded. The repository.
Note that during the first build the makefile triggers the downloading of an out-dated version of the Abmiq SDK, which contains errors that were fixed in newer versions. One such error was related to the PDM clock configuration, therefore the sampling frequency of the audio recording cannot be changed through the corresponding am_hal_ interface. A possible solution is to modify the base makefile to download the newer version of the SDK.
To deploy our model, we only substituted the array found in micro_features/tiny_conv_micro_features_model_data.cc
with the byte array that was generated earlier during the model training, so the manipulation of the Makefile was not required.
The data acquisition part was similar to the already introduced method. When the DMA finished with the collection of a 1 second long part, it generates an interrupt and starts recording a new 1 second long period into another buffer. Meanwhile, the already filled buffer is processed by invoking the trained neural network and the result is forwarded to the PC through UART.
The operation of the detector is presented in Video 2. On the left side, the Arduino connection is visible, which shows that when an 's' character is sent, a spark is generated. Shortly, a "Spark detected!" message is expected on the right side, which prints the messages from the Artemis Board.
Summary:
In this section, we presented a baseline solution for a detection problem that aims to classify audio recordings based on spark sound containment. We included source codes for data collection, for model training, and for model deployment and inference.
Advanced methods:
The simple model trained in the previous section achieved acceptable accuracy on the test dataset, however, during its real-world evaluation we could test its robustness against other impulsive loud events like claps or knocks. Based on these experiments it could be concluded that the model was capable of recognizing loud, impulsive events rather than spark sound only. It is reasonable that such a simple architecture cannot generalize knowledge to detect these complicated patterns in complex background noises.
In the current section, we demonstrate the usage of advanced methods that can help to find more suitable models with enhanced accuracy and robustness, and optimal memory and computational complexities.
accuracy: the ratio of correctly classified examples
robustness: the measure of the average input perturbation amplitude that mislead a classifier
memory complexity: the total amount of memory required to run a model
computational complexity: the total number of floating point operations that must be executed to run a model
The simple model presented earlier was created in an ad-hoc way, based on some experience. Even if an initial architecture is known, its hyper-parameters that provide the best results are unknown. Therefore, we started from the baseline model and implemented a searching algorithm that is capable of finding a superior hyper-parameter set. This approach is called grid-search, which collects hyper-parameters into sets from given intervals and test these configurations based on some metrics. In our case, the parameters that were taken into consideration were the following:
- the number of kernels in the convolutional layer: [3, 5, 8, 13]
- dilation rate of the convolutional kernels: [1, 2, 3]
- size of the convolutional kernels: [15, 36, 57, 93, 150]
To evaluate a particular hyper-parameter set we employed the accuracy and robustness metrics. The accuracy is simple, it is the ratio between the correctly classified and the total number of examples. The robustness is more complicated. Without the full scientific background, it can be summarized as the measure of model insensitivity to input perturbations, and the average amplitude of these perturbations that imply false classifications is the measure of this property. The research field that studies this parameter is called adversarial machine learning. We used a slightly modified version of the DeepFool method to measure this property of our NNs.
Besides the hyper-parameter optimization, we also extended the model performance checking by adding Gaussian noise with different standard deviation values to the inputs. With the incrementation of the noise level, the signal-to-noise ratio decreases, which makes the detection problem even harder. The noise parameters were chosen from the [0.00, 0.01, 0.05, 0.1] set. To make these values interpretable, an example recording is shown in Figure 5 with the different noise levels. It can be observed that in the most extreme case the spark shape is totally lost in the noise.
All combinations from the presented parameter value intervals were picked and the corresponding neural networks were generated accordingly. This resulted in 240 generated models. Each network was trained on the same training dataset and evaluated on the validation dataset. The Gaussian noise was generated on-the-fly during the trainings, which were carried out with the following parameters:
- Batch size: 5
- Early Stopping: monitored the training loss with the patience of 10 epochs
- Optimizer: Adam
The result of the grid-search is illustrated in Figure 6. Here, the x-axis represents the accuracy, and the y-axis shows the logarithm of the average perturbation size. Larger perturbations represent better robustness. Each symbol on the plot has a shape that encodes the noise level, a diameter that expresses memory complexity, and a color that encodes the computational complexity of a neural network. Noise levels' symbols: star
- no noise added; circle
- noise level 0.01; square
- noise level 0.05; triangle
- noise level 0.1.
In Figure 6, several point-clusters can be identified. For example, it is observable that a higher noise level reduces accuracy but enhances robustness (triangles
in the upper-left corner). Another example is the cluster of squares
at the middle, which has an evolution from the left to the right side and from the bottom to the upper side at the same time, which means that some parameter sets improve accuracy and robustness as well.
In our case, a model with good performance and robustness is required, but as we want to deploy it on a microcontroller, the memory and computational complexities must be taken into consideration too. These parameters are encoded into the color and size of a point. According to the color bar, a blue point is required with a small diameter, from the right side of the plot, which point also maximizes the robustness as well. We selected the model, which is represented by the single, outlying blue circle
that is above the cluster of circles, at the right side of the top of the squares
cluster. This model was evaluated on the test dataset. The parameters and the performance of the model are:
Accuracy on the test dataset: 0.99074
Accuracy on the training dataset: 0.99444
Robustness: 0.00136
---------------------------------------------
Dilation rate: 1
Kernel size: 57
Number of kernels: 5
Added noise level: 0.01
---------------------------------------------
Memory complexity: 238 KB
Computational complexity: 3.4 MFLOP (12kS input size)
This model has a higher computational complexity than our baseline model had, so the inference requires the activation of the Burst Mode of the Apollo 3 MCU. In this state, the core clock frequency is doubled from 48 MHz to 96 MHz.
Another advantage of the proposed NN architecture is that the full-window maximum pooling (called GlobalMaxPooling, but not supported in TF Lite Micro) enables the model to accept various input lengths. For example, it was found that if we reduce the input length to 3000 samples from 12000 samples, thememory complexity can be reduced significantly: from 238 KB to 14 KB. A disadvantage is that if we want to run the detector on the signal with overlapping regions to ensure full spark event containment, we must invoke the inference 7 times, instead of the previous 2. However, the MCU is fast enough to handle the computational overhead (a total of 5.6 MFLOPs).
As we applied adversarial attacks to measure the robustness of the NN structures, it is easily possible to visualize some of these adversarial examples. One such example is shown in Figure 7. Here, the goal is to generate a recording that is on the edge of the decision surface of an already trained neural network structure. This example is generated from an originally negative sample (absolute silence) but in the current form, it fools the network so that it would produce a positive label.
These methods are complicated, andwe think that the publication of the source codewould not contribute to the general applicability of the mentioned directions, therefore we only share these files via e-mail upon request.
Project summary:We implemented a spark sound detector based on a neural network that can be deployed on the SparkFun RedBoard Artemis ATP. The data was collected by utilizing the same device and its integrated MEMS microphone. The data acquisition employed spark generation with different background noises.
A simple pipeline for beginners was explained and a baseline neural network model was deployed. We shared the source code for all the major steps required to solve a similar problem.
Additionally, more advanced methods and ideas were included that enable the enhancement of model performance and robustness.
In the future, we plan to integrate the Artemis board into our Animal-borne gunshot detector, which is under active development. The advanced results presented in this report may provide the basis for the research of these directions.
Incompleted tasks:
Comments