Key Features
Development Environment
Hints Concerning the Use of ArduinoFFT on ESP32
Source Code
Overview
Realtime Processing
Recording of Audio Data using I2S
Transformation of Recorded Data into the Frequency Domain
Visualization using the RGB Led Strip
References

Published August 8, 2021 © GPL3+

Audio Visualization with ESP32, i2s Mic and RGB Led Strip

M5StickC (ESP32 Pico) with built-in i2s microphone is used for audio spectrum analysis (ArduinoFFT) and visualization on an RGB led strip.

IntermediateFull instructions provided1 hour14,106

Things used in this project

Hardware components

M5Stack M5StickC ESP32-PICO Mini IoT Development Board

M5Stack SK6812 RGB LED Flex-strip

Software apps and online services

PlatformIO IDE

Microsoft VS Code

Story

I did this project in order to explore i2s audio recording and realtime digital audio signal processing using an ESP32 pico. Employing the M5StickC for this project allowed me to acquire digital audio via the built-in microphone and to attach an led strip for visualization with the Grove connector. Thanks to its internal battery, the device is also portable.

Key Features

Digital audio capturing using a microphone at a sample rate of 44.1 kHz
Real-time analysis of the audio signal based on FFT with 2048 sample size (ArduinoFFT)
Frequency-based audio visualization using an RGB led strip (FastLed)
Configurable frequency bands between 21.5 Hz and 20 kHz at a resolution of 21.5 Hz.
Visualization based on beat detection
Auto-levelling to adapt to the recorded audio volume
Detection of timing problems and frame loss for debugging
Portable battery-powered device

Development Environment

As for my previous projects, I used Visual Studio Code with the PlatformIO IDE for this project. In the platformio.ini file of the project, library dependencies to the M5StickC library, and the ArduinoFFT [1] library, the FastLED library [6] are included.

Furthermore, I used the build_flags option to enable debug-level log messages, and the monitor_filters option to enable file logging and exception stack trace decoding:

[env:M5StickC_AudioVisLed_Debug]
platform = espressif32
board = m5stick-c
framework = arduino

lib_deps =
    M5StickC
    FastLED
    https://github.com/kosme/arduinoFFT.git#develop

upload_speed = 1500000
monitor_speed = 115200

build_type = debug
build_flags = -D CORE_DEBUG_LEVEL=4 ; 'Debug'

monitor_filters = log2file, esp32_exception_decoder, default

Hints Concerning the Use of ArduinoFFT on ESP32

I use the develop branch of the ArduinoFFT library which allows to set the type of the FFT input data to float instead of double. Thereby, the computation time of the FFT on the ESP32 pico is significantly reduced. The reason for this is that the ESP32 has got a floating point unit which is able to perform arithmetic operations on float data in hardware. In contrast, arithmetic operations on data of double type need to be performed in software and take 20 to 60 times longer (see discussion and performance tests in the ESP32 forum).

For building the project in Visual Studio code, the intitialization order of some variables in arduinoFFT.h needs to be corrected as follows (see https://github.com/kosme/arduinoFFT/pull/59):

/* Variables */
T *_vReal = nullptr;
T *_vImag = nullptr;
uint_fast16_t _samples = 0;
#ifdef FFT_SPEED_OVER_PRECISION
T _oneOverSamples = 0.0;
#endif
T _samplingFrequency = 0;

Without this correction, the build process terminates with errors.

Source Code

The commented source code is available in the GitHub repository of this project.

Overview

The following figure shows an overview of the audio visualization processing.

Overview of the audio visualization processing

At startup, the application configures the recording and transfer of microphone audio samples using the i2s driver of the operating system (see [4], [5]).
From the then on, the operating system continuously transfers audio samples into a chain of memory buffers using the DMA hardware of the ESP32 (DMA = Direct Memory Access).
Each time a DMA transfer buffer is full, the operating system generates an event of the type I2S_EVENT_RX_DONE and goes on with transferring audio samples into the next DMA transfer buffer. After the last buffer of the chain has been filled, the DMA transfer continues with the first buffer (which gets overwritten no matter whether the application has processed the data or not).
The application can request chunks of audio samples by calling i2s_read on the i2s driver. Therein, a "chunk" represents a recording duration of, for example, 10 ms or 20 ms (milliseconds). The application itself is blocked until the i2s driver has copied the requested amount of data from its DMA transfer buffers to the destination buffer of the application.
In case the i2s driver does not have the required amount of data stored in its DMA transfer buffers, the application remains blocked until the i2s driver has acquired the requested amount via DMA transfer and copied it to the destination buffer.
An essential step of the audio data processing within the application is the transformation of the recorded chunk of audio data from the time domain into the frequency domain. For this purpose, typically an FFT (fast fourier transform) algorithm is used. The result of the transformation is a representation of the recorded chunk as a weighted sum of sine and cosine waves at distinct frequencies. I use the arduinoFFT library to accomplish this step.
The visualization processes the frequency components of the recorded audio signal in order to compute intensities for the RGB leds of the led strip. As in previous projects, I use the FastLED library for rendering the led output.

Realtime Processing

For the visualization to to work properly, I had to pay attention to the time consumption of the audio data processing. If the processing took longer than the recording of the next chunk of audio samples, some chunks would remain unprocessed and would be overwritten by newer samples. The visualization would then omit part of the input audio signal.

At the current state, the processing in the debug build takes about 6.7 ms, and 4.6 ms in the release build. Skipping of audio chunks would occur if processing took longer than the recording time of a chunk, i.e. 46.4 ms.

Hence, the processing in the debug build consumes only about 15% of the available time slot, and 10% in the release build.

Recording of Audio Data using I2S

When setting up the i2s recording, the desired audio frequency spectrum and frequency resolution need to be accounted for. Basically, you should think about the kind of audio data (e.g. guitar, piano, rock music, classical music) and the audio source (e.g. real instruments, hifi stereo speakers, smartphone speaker) that you want to visualize.

The i2s configuration is created by the following lines of code:

i2s_config_t i2sConfig = {
    .mode = (i2s_mode_t)(I2S_MODE_MASTER | I2S_MODE_RX | I2S_MODE_PDM),
    .sample_rate = kSampleRate,
    .bits_per_sample = kI2S_BitsPerSample,
    .channel_format = I2S_CHANNEL_FMT_ONLY_RIGHT,
    .communication_format = I2S_COMM_FORMAT_I2S,
    .intr_alloc_flags = ESP_INTR_FLAG_LEVEL1,
    .dma_buf_count = kI2S_BufferCount,
    .dma_buf_len = kI2S_BufferSizeSamples
};

The sample rate (kSampleRate) must be be twice as high as the maximum frequency you want to be able to detect and visualize. A typical value for music is 44100 Hz which allows to capture frequencies up to 22050 Hz.

A couple of settings are determined by your hardware and can be taken from examples of the manufacturer.

mode,communication_format, intr_alloc_flags : Configuration of the data acquisition from the microphone and the use of interrupts by the DMA controller
bits_per_sample : Number of bits that each microphone sample has (e.g. 16 bits = 2 bytes). For instance, an unsigned 16-bit sample can take values between 0 and 65535.
channel_format : Basically, for a microphone, one should configure only a single audio channel. It may be declared as left or right at will.
dma_buf_count : Number of DMA transfer buffers that the buffer chain shall be composed of. There need to be at least two buffers, so that while the content of one buffer is copied to the application, the other buffer can receive data by a DMA transfer.
dma_buf_length : The size of each DMA buffer in terms of the number of samples. It is noteworthy that the size of a DMA buffer is limited to 4092 bytes (see [2]) which means 2046 samples in case of 2 bytes per sample.

In my setup, I have configured three DMA transfer buffers with a size of 1024 samples. This data is read by the application in chunks of 2048 samples, so that each FFT is fed by 2048 samples which represent about 46.44 ms duration of audio recording at a sample rate of 44100 Hz.

If you want to be able to evaluate i2s events, you also need to provide a variable for the i2s event queue:

QueueHandle_t pI2S_Queue_ = nullptr;

The i2s driver is initialised using the previously defined configuration, additionally setting the i2s port number and the length of the event queue:

i2sErr = i2s_driver_install(kI2S_Port, &i2sConfig, kI2S_QueueLength, &pI2S_Queue_);

Next, the pins for i2s data acquisition from the microphone are set according to the manufacturer's specification:

i2s_pin_config_t i2sPinConfig = {
    .bck_io_num = I2S_PIN_NO_CHANGE,
    .ws_io_num = kI2S_PinClk,
    .data_out_num = I2S_PIN_NO_CHANGE,
    .data_in_num = kI2S_PinData
};

i2sErr = i2s_set_pin(kI2S_Port, &i2sPinConfig);

Eventually, the application needs a buffer in which the sampled data is stored for further processing:

int16_t micReadBuffer_[kFFT_SampleCount] = {0};

Now, the application can read sampled audio data chunks using i2s:

i2sErr = i2s_read(kI2S_Port, micReadBuffer_, kI2S_ReadSizeBytes, &i2sBytesRead, 100 / portTICK_PERIOD_MS);

Analysing the return value i2sErr, the i2sBytesRead variable, and the event queue pI2S_Queue_ helps to identify errors in the application and its configuration. For instance, one can check, whether the number of I2S_EVENT_RX_DONE is as expected in order to detect "frame loss":

if (i2sEventRxDoneCount > kI2S_BufferCountPerFFT)
{
    log_w("Frame loss. Number of I2S_EVENT_RX_DONE events: %d", i2sEventRxDoneCount);
}

Transformation of Recorded Data into the Frequency Domain

Computing the FFT on chunks of 2048 samples which are sampled at 44100 Hz results in 1024 frequency bins and a frequency resolution of about 21.5 Hz. Hence, the lowest detectable frequency above zero is 21.5 Hz, and adjacent audio frequencies are distinguishable in 21.5 Hz increments up to 22050 Hz.

In order to use the FFT library, the ArduinoFFT class is instantiated, passing the desired sample value type (in my case: float) as template parameter:

fftData_t fftDataReal_[kFFT_SampleCount] = {0.0};
fftData_t fftDataImag_[kFFT_SampleCount] = {0.0};

ArduinoFFT<fftData_t> fft_ =
    ArduinoFFT<fftData_t>(fftDataReal_, fftDataImag_,
                          kFFT_SampleCount, kFFT_SamplingFreq);

The first two parameters, fftDataReal_ and fftDataImag_ are two arrays that will hold both, input and output values of the FFT algorithm:

For each chunk of recorded data, the sample values are normalised and written into fftDataReal_. Normalisation is performed by subtracting the chunk average value from each sample and dividing by the constant __INT16_MAX__. Hence, input values have a maximum range of -1.0 to +1.0.
The array fftDataImag_ is initialised with 0.0, since the input data is real-valued, i.e. the imaginary parts of the input samples are zero.

After providing the input data, the (forward) FFT is computed:

fft_.compute(FFTDirection::Forward);

Subsequently, the FFT results are contained in the stated two arrays fftDataReal_ and fftDataImag_. Therein, only the first half of each of the array elements is relevant, since, for real-valued input, the second half is the complex conjugate of the first half.

The first half of the elements of fftDataReal_[i] and fftDataImag_[i], in my case elements 0 to 1023, contain the complex coefficients associated with the frequency bins f(i):

f(i) = i * (sample_rate / number_of_samples).

In fact, the bin with f(0) = 0 Hz contains the so-called "DC part" of the input signal which is also not used further, here.

As we are interested only in the amplitude of each wave i with frequency f(i), we compute the magnitudes of the complex values:

sqrtf( fftDataReal_[i] * fftDataReal_[i] + fftDataImag_[i] * fftDataImag_[i] )

This is the main result of the FFT, that is used for visualizing the audio signal.

The other information that could be retrieved from fftDataReal_[i] and fftDataImag_[i] would be the phase shift of each wave. For more details on the fourier transformation and its results, see e.g. [2] and [3].

Visualization using the RGB Led Strip

The magnitue values resulting from the FFT are not used directly to drive the visulisation but are smoothed in order to achieve more fluent visuals:

const float w1 = 16.0f / 128.0f;
const float w2 = 1 - w1;
magnitudeSpectrumAvg_[i] = magValNew * w1 + magnitudeSpectrumAvg_[i] * w2;

The smoothed magnitude values are computed by means of exponential smoothing which is a linear interpolation of the current magnitude value and the previous smoothed value at each audio chunk.

Furthermore, since there are more frequency bins than leds on the led strip, I have grouped the frequency bins into frequency bands. The frequency bands are defined as a sequence of frequencies (constant kFreqBandEndHz) that seperate the bands. To determine the value of each frequency band, the maximum value over all frequency bins contained in this band is computed.

const float kFreqBandEndHz[kFreqBandCount] = {60, 125, 250, 375, 500, 750, 1000, ...}

So far, the visualization would strongly depend on the volume level recorded by the microphone. To allow the visualization to adjust itself to the current, recorded audio volume, a variable for sensitivity is introduced. The sensitivity updated at each audio chunk taking into account the previous sensitivity, the current maximum amplitude over all frequency bands, and the maximum allowed sensitivity:

const float s1 = 8.0f / 1024.0f;
const float s2 = 1.0f - s1;

sensitivityFactor_ =  min( (250.0f / magnitudeBandWeightedMax) * s1
                        + sensitivityFactor_ * s2, kSensitivityFactorMax );

Eventually, the lightness values of the leds are set according to the computed amplitudes and sensitivity:

lightness = min( int(magnitudeBand[k] * kFreqBandAmp[k] * sensitivityFactor_), 255);

ledStrip_[k+numBassLeds-1].setHSV(color, 255, lightness);

Furthermore, part of the led is strip is set using a slightly different mode. The algorithm detects beats and, on each detected beat sets the lightness to a specified value. When now beat is detected, the lightness gradually decreases. For detecting beats, the last three magnitude values are considered in order to detect a local maximum.