I did this project in order to explore i2s audio recording and realtime digital audio signal processing using an ESP32 pico. Employing the M5StickC for this project allowed me to acquire digital audio via the built-in microphone and to attach an led strip for visualization with the Grove connector. Thanks to its internal battery, the device is also portable.
Key Features- Digital audio capturing using a microphone at a sample rate of 44.1 kHz
- Real-time analysis of the audio signal based on FFT with 2048 sample size (ArduinoFFT)
- Frequency-based audio visualization using an RGB led strip (FastLed)
- Configurable frequency bands between 21.5 Hz and 20 kHz at a resolution of 21.5 Hz.
- Visualization based on beat detection
- Auto-levelling to adapt to the recorded audio volume
- Detection of timing problems and frame loss for debugging
- Portable battery-powered device
As for my previous projects, I used Visual Studio Code with the PlatformIO IDE for this project. In the platformio.ini
file of the project, library dependencies to the M5StickC library, and the ArduinoFFT [1] library, the FastLED library [6] are included.
Furthermore, I used the build_flags
option to enable debug-level log messages, and the monitor_filters
option to enable file logging and exception stack trace decoding:
[env:M5StickC_AudioVisLed_Debug]
platform = espressif32
board = m5stick-c
framework = arduino
lib_deps =
M5StickC
FastLED
https://github.com/kosme/arduinoFFT.git#develop
upload_speed = 1500000
monitor_speed = 115200
build_type = debug
build_flags = -D CORE_DEBUG_LEVEL=4 ; 'Debug'
monitor_filters = log2file, esp32_exception_decoder, default
Hints Concerning the Use of ArduinoFFT on ESP32I use the develop
branch of the ArduinoFFT library which allows to set the type of the FFT input data to float
instead of double
. Thereby, the computation time of the FFT on the ESP32 pico is significantly reduced. The reason for this is that the ESP32 has got a floating point unit which is able to perform arithmetic operations on float data in hardware. In contrast, arithmetic operations on data of double type need to be performed in software and take 20 to 60 times longer (see discussion and performance tests in the ESP32 forum).
For building the project in Visual Studio code, the intitialization order of some variables in arduinoFFT.h
needs to be corrected as follows (see https://github.com/kosme/arduinoFFT/pull/59):
/* Variables */
T *_vReal = nullptr;
T *_vImag = nullptr;
uint_fast16_t _samples = 0;
#ifdef FFT_SPEED_OVER_PRECISION
T _oneOverSamples = 0.0;
#endif
T _samplingFrequency = 0;
Without this correction, the build process terminates with errors.
Source CodeThe commented source code is available in the GitHub repository of this project.
OverviewThe following figure shows an overview of the audio visualization processing.
- At startup, the application configures the recording and transfer of microphone audio samples using the
i2s driver
of the operating system (see [4], [5]). - From the then on, the operating system continuously transfers audio samples into a chain of memory buffers using the DMA hardware of the ESP32 (DMA = Direct Memory Access).
- Each time a DMA transfer buffer is full, the operating system generates an event of the type
I2S_EVENT_RX_DONE
and goes on with transferring audio samples into the next DMA transfer buffer. After the last buffer of the chain has been filled, the DMA transfer continues with the first buffer (which gets overwritten no matter whether the application has processed the data or not). - The application can request chunks of audio samples by calling
i2s_read
on the i2s driver. Therein, a "chunk" represents a recording duration of, for example, 10 ms or 20 ms (milliseconds). The application itself is blocked until the i2s driver has copied the requested amount of data from its DMA transfer buffers to the destination buffer of the application. - In case the i2s driver does not have the required amount of data stored in its DMA transfer buffers, the application remains blocked until the i2s driver has acquired the requested amount via DMA transfer and copied it to the destination buffer.
- An essential step of the audio data processing within the application is the transformation of the recorded chunk of audio data from the time domain into the frequency domain. For this purpose, typically an FFT (fast fourier transform) algorithm is used. The result of the transformation is a representation of the recorded chunk as a weighted sum of sine and cosine waves at distinct frequencies. I use the
arduinoFFT
library to accomplish this step. - The visualization processes the frequency components of the recorded audio signal in order to compute intensities for the RGB leds of the led strip. As in previous projects, I use the
FastLED
library for rendering the led output.
For the visualization to to work properly, I had to pay attention to the time consumption of the audio data processing. If the processing took longer than the recording of the next chunk of audio samples, some chunks would remain unprocessed and would be overwritten by newer samples. The visualization would then omit part of the input audio signal.
At the current state, the processing in the debug build takes about 6.7 ms, and 4.6 ms in the release build. Skipping of audio chunks would occur if processing took longer than the recording time of a chunk, i.e. 46.4 ms.
Hence, the processing in the debug build consumes only about 15% of the available time slot, and 10% in the release build.
Recording of Audio Data using I2SWhen setting up the i2s recording, the desired audio frequency spectrum and frequency resolution need to be accounted for. Basically, you should think about the kind of audio data (e.g. guitar, piano, rock music, classical music) and the audio source (e.g. real instruments, hifi stereo speakers, smartphone speaker) that you want to visualize.
The i2s configuration is created by the following lines of code:
i2s_config_t i2sConfig = {
.mode = (i2s_mode_t)(I2S_MODE_MASTER | I2S_MODE_RX | I2S_MODE_PDM),
.sample_rate = kSampleRate,
.bits_per_sample = kI2S_BitsPerSample,
.channel_format = I2S_CHANNEL_FMT_ONLY_RIGHT,
.communication_format = I2S_COMM_FORMAT_I2S,
.intr_alloc_flags = ESP_INTR_FLAG_LEVEL1,
.dma_buf_count = kI2S_BufferCount,
.dma_buf_len = kI2S_BufferSizeSamples
};
The sample rate (kSampleRate
) must be be twice as high as the maximum frequency you want to be able to detect and visualize. A typical value for music is 44100 Hz which allows to capture frequencies up to 22050 Hz.
A couple of settings are determined by your hardware and can be taken from examples of the manufacturer.
mode,communication_format
,intr_alloc_flags
: Configuration of the data acquisition from the microphone and the use of interrupts by the DMA controllerbits_per_sample
: Number of bits that each microphone sample has (e.g. 16 bits = 2 bytes). For instance, an unsigned 16-bit sample can take values between 0 and 65535.channel_format
: Basically, for a microphone, one should configure only a single audio channel. It may be declared as left or right at will.dma_buf_count
: Number of DMA transfer buffers that the buffer chain shall be composed of. There need to be at least two buffers, so that while the content of one buffer is copied to the application, the other buffer can receive data by a DMA transfer.dma_buf_length
: The size of each DMA buffer in terms of the number of samples. It is noteworthy that the size of a DMA buffer is limited to 4092 bytes (see [2]) which means 2046 samples in case of 2 bytes per sample.
In my setup, I have configured three DMA transfer buffers with a size of 1024 samples. This data is read by the application in chunks of 2048 samples, so that each FFT is fed by 2048 samples which represent about 46.44 ms duration of audio recording at a sample rate of 44100 Hz.
If you want to be able to evaluate i2s events, you also need to provide a variable for the i2s event queue:
QueueHandle_t pI2S_Queue_ = nullptr;
The i2s driver is initialised using the previously defined configuration, additionally setting the i2s port number and the length of the event queue:
i2sErr = i2s_driver_install(kI2S_Port, &i2sConfig, kI2S_QueueLength, &pI2S_Queue_);
Next, the pins for i2s data acquisition from the microphone are set according to the manufacturer's specification:
i2s_pin_config_t i2sPinConfig = {
.bck_io_num = I2S_PIN_NO_CHANGE,
.ws_io_num = kI2S_PinClk,
.data_out_num = I2S_PIN_NO_CHANGE,
.data_in_num = kI2S_PinData
};
i2sErr = i2s_set_pin(kI2S_Port, &i2sPinConfig);
Eventually, the application needs a buffer in which the sampled data is stored for further processing:
int16_t micReadBuffer_[kFFT_SampleCount] = {0};
Now, the application can read sampled audio data chunks using i2s:
i2sErr = i2s_read(kI2S_Port, micReadBuffer_, kI2S_ReadSizeBytes, &i2sBytesRead, 100 / portTICK_PERIOD_MS);
Analysing the return value i2sErr
, the i2sBytesRead
variable, and the event queue pI2S_Queue_
helps to identify errors in the application and its configuration. For instance, one can check, whether the number of I2S_EVENT_RX_DONE is as expected in order to detect "frame loss":
if (i2sEventRxDoneCount > kI2S_BufferCountPerFFT)
{
log_w("Frame loss. Number of I2S_EVENT_RX_DONE events: %d", i2sEventRxDoneCount);
}
Transformation of Recorded Data into the Frequency DomainComputing the FFT on chunks of 2048 samples which are sampled at 44100 Hz results in 1024 frequency bins and a frequency resolution of about 21.5 Hz. Hence, the lowest detectable frequency above zero is 21.5 Hz, and adjacent audio frequencies are distinguishable in 21.5 Hz increments up to 22050 Hz.
In order to use the FFT library, the ArduinoFFT
class is instantiated, passing the desired sample value type (in my case: float
) as template parameter:
fftData_t fftDataReal_[kFFT_SampleCount] = {0.0};
fftData_t fftDataImag_[kFFT_SampleCount] = {0.0};
ArduinoFFT<fftData_t> fft_ =
ArduinoFFT<fftData_t>(fftDataReal_, fftDataImag_,
kFFT_SampleCount, kFFT_SamplingFreq);
The first two parameters, fftDataReal_
and fftDataImag_
are two arrays that will hold both, input and output values of the FFT algorithm:
- For each chunk of recorded data, the sample values are normalised and written into
fftDataReal_
. Normalisation is performed by subtracting the chunk average value from each sample and dividing by the constant__INT16_MAX__
. Hence, input values have a maximum range of -1.0 to +1.0. - The array
fftDataImag_
is initialised with 0.0, since the input data is real-valued, i.e. the imaginary parts of the input samples are zero.
After providing the input data, the (forward) FFT is computed:
fft_.compute(FFTDirection::Forward);
Subsequently, the FFT results are contained in the stated two arrays fftDataReal_
and fftDataImag_
. Therein, only the first half of each of the array elements is relevant, since, for real-valued input, the second half is the complex conjugate of the first half.
The first half of the elements of fftDataReal_[i]
and fftDataImag_[i]
, in my case elements 0 to 1023, contain the complex coefficients associated with the frequency bins f(i)
:
f(i) = i * (sample_rate / number_of_samples)
.
In fact, the bin with f(0) = 0 Hz contains the so-called "DC part" of the input signal which is also not used further, here.
As we are interested only in the amplitude of each wave i
with frequency f(i)
, we compute the magnitudes of the complex values:
sqrtf( fftDataReal_[i] * fftDataReal_[i] + fftDataImag_[i] * fftDataImag_[i] )
This is the main result of the FFT, that is used for visualizing the audio signal.
The other information that could be retrieved from fftDataReal_[i]
and fftDataImag_[i]
would be the phase shift of each wave. For more details on the fourier transformation and its results, see e.g. [2] and [3].
The magnitue values resulting from the FFT are not used directly to drive the visulisation but are smoothed in order to achieve more fluent visuals:
const float w1 = 16.0f / 128.0f;
const float w2 = 1 - w1;
magnitudeSpectrumAvg_[i] = magValNew * w1 + magnitudeSpectrumAvg_[i] * w2;
The smoothed magnitude values are computed by means of exponential smoothing which is a linear interpolation of the current magnitude value and the previous smoothed value at each audio chunk.
Furthermore, since there are more frequency bins than leds on the led strip, I have grouped the frequency bins into frequency bands. The frequency bands are defined as a sequence of frequencies (constant kFreqBandEndHz
) that seperate the bands. To determine the value of each frequency band, the maximum value over all frequency bins contained in this band is computed.
const float kFreqBandEndHz[kFreqBandCount] = {60, 125, 250, 375, 500, 750, 1000, ...}
So far, the visualization would strongly depend on the volume level recorded by the microphone. To allow the visualization to adjust itself to the current, recorded audio volume, a variable for sensitivity is introduced. The sensitivity updated at each audio chunk taking into account the previous sensitivity, the current maximum amplitude over all frequency bands, and the maximum allowed sensitivity:
const float s1 = 8.0f / 1024.0f;
const float s2 = 1.0f - s1;
sensitivityFactor_ = min( (250.0f / magnitudeBandWeightedMax) * s1
+ sensitivityFactor_ * s2, kSensitivityFactorMax );
Eventually, the lightness values of the leds are set according to the computed amplitudes and sensitivity:
lightness = min( int(magnitudeBand[k] * kFreqBandAmp[k] * sensitivityFactor_), 255);
ledStrip_[k+numBassLeds-1].setHSV(color, 255, lightness);
Furthermore, part of the led is strip is set using a slightly different mode. The algorithm detects beats and, on each detected beat sets the lightness to a specified value. When now beat is detected, the lightness gradually decreases. For detecting beats, the last three magnitude values are considered in order to detect a local maximum.
References[1] arduinoFFT - An FFT library for Arduino: https://github.com/kosme/arduinoFFT
[2] Online article: "Interpret FFT, complex DFT, frequency bins & FFTShift", 2015: https://www.gaussianwaves.com/2015/11/interpreting-fft-results-complex-dft-frequency-bins-and-fftshift/
[3] Online article: "Interpret FFT results β obtaining magnitude and phase information", 2015: https://www.gaussianwaves.com/2015/11/interpreting-fft-results-obtaining-magnitude-and-phase-information/
[4] Espressif ESP-IDF, API reference for i2s: https://docs.espressif.com/projects/esp-idf/en/latest/esp32/api-reference/peripherals/i2s.html
[5] Espressif ESP-IDF, source code of i2s driver: https://github.com/espressif/esp-idf/blob/master/components/driver/i2s.c
[6] FastLED: https://github.com/FastLED/FastLED
Comments