Two factors contribute to our sense of sound direction:
- the difference in sound pressure of the sound in both ears
- the time difference of the sound in both ears
In this project I'm building a robot car that uses two microphones connected to a microcontroller to determine the direction of a sound source and move towards the sound.
The hardwareThe boardCY8CPROTO-062-4343W from Infineon is a microcontroller evaluation board, which, among other cool stuff, contains a pair of microphones. The microphones provide sound data, which at the end converts to a 16 bit, 8000 Hz sound wave. Well, actually two sound waves, because we have two microphones.
I really hate this kind of naming conventions. I don't even know what 062 refers to, or 4343. I'm going to call the board CY8CPROTO.
An accelerometer/gyro is used to make the robot car turn exact angles.
An mp3-player module is used to make the robot speak pre-recorded phrases.
Simple sound pressure comparisonFor a layman (like me), sound pressure, sound volume and sound loudness is more or less the same thing. For the case of simplicity, let's say that's true. I'm going to talk about sound pressure. And I'm going to measure and calculate it in a probably very un-scientific manner. Long story short, this method is unreliable! Jump to the chapter Time difference analysis, if you want.
To get the CY8CPROTO working and programmed, I downloaded and installed Eclipse IDE for Modus Toolbox 3.1. With it installed, as a template project I opened the PDM_PCM_Audio project. The crucial parts of the code goes as follows:
for(;;)
{
/* Check if any microphone has data to process */
if (pdm_pcm_flag)
{
/* Clear the PDM/PCM flag */
pdm_pcm_flag = 0;
/* Reset the volume */
volume = 0;
/* Calculate the volume by summing the absolute value of all the
* audio data from a frame */
for (uint32_t index = 0; index < FRAME_SIZE; index++)
{
volume += abs(audio_frame[index]);
}
/* Prepare line to report the volume */
printf("\n\r");
/* Report the volume */
for (uint32_t index = 0; index < (volume/VOLUME_RATIO); index++)
{
printf("-");
}
...
What happens here is that audio_frame[]
is filled with sound data. It holds signed integers describing individual samples. The samples are basically values in the range -32768 to 32767. And as the comments in the code imply, the "volume" is calculated as a sum of the absolute values of the samples of a given time. A dashed line is then written, the length of which correspond to the "volume".
The audio_frame[]
actually holds the data from both microphones. Every second value in the array is from one channel. I changed the code so that it would separate the cannels to left and right (volumeL
and volumeR
):
for(;;)
{
/* Check if any microphone has data to process */
if (pdm_pcm_flag)
{
/* Clear the PDM/PCM flag */
pdm_pcm_flag = 0;
/* Reset the volume */
volumeL = 0;
volumeR = 0;
/* Calculate the volume by summing the absolute value of all the
* audio data from a frame */
for (uint32_t index = 0; index < FRAME_SIZE; index += 2)
{
volumeL += abs(audio_frame[index]);
volumeR += abs(audio_frame[index+1]);
}
/* Prepare line to report the volume */
printf("\n\r");
/* Report the volume */
printf("%d, %d", volumeL, volumeR);
...
The last line prints both values on one line. A typical output could look like this:
1996, 2066
1774, 1611
2783, 2842
1582, 1646
1850, 1433
2145, 2085
1892, 1983
2023, 2229
1943, 2439
2064, 2424
2528, 2468
2133, 2756
...
This is basically just background noise. Instead of viewing a text output, I want to view this plotted. The Arduino IDE is perfect for that. Here's an example output:
The peak happened when I snapped my fingers at one of the microphones. The red line is drawn according to this microphone. When I snap my fingers at the other microphone, I get this:
Now the colours have swapped. The blue peak is now higher. We could develop this into a system that detects the direction of the sound. But testings showed that it's not of much use to measure the sound pressure to detect the direction. Different sounds behave differently. To get it working, a machine learning model should probably be created with a lot of training data.
Defining directionsThe PCB, where the microphones are, has the markings LEFT and RIGHT. So my directions will be as follows: Left is 0°, straight ahead is 90°, right is 180°.
Time difference analysisInstead of measuring the sound volume (or pressure or lodness, whatever), we record the sound wave itself. Snapping my fingers to the left of my computer screen and recording the sound with Audacity, I get this:
The lower sound wave is obviously recorded with the microphone to the left of my camera. Magnifying the view I can see that the upper signal is 18 samples later. The recording was done at 44100 Hz. 18 samples means 0.4 ms, which in turn means 139 mm, which would be the distance between my microphones. That is actually exactly the case! My laptop is a Microsoft Surface, and it appears it indeed has two microphones, one facing me together with the camera. And the other mic on the other side, together with the other camera. I bet these microphones are not the same models, as are not the cameras either. And that might be the reason why the lower sound track appears to be weaker, although the corresponding microphone was nearer my snapping fingers.
The distance of the microphones on the CY8CPROTO is 41 mm. With 44100 Hz sampling, I would get only a 5 sample difference in time. And with only a 16000 Hz sampling rate I will have 1.93 samples.
So, if I assume the two channels can be ±2 samples shifted from each other, depending on the direction of the sound, I got 5 cases:
- -2: far to the left, or 0°
- -1: a bit to the left
- 0: straight ahead, or 90°
- 1: a bit to the right
- 2: far to the right, or 180°
The method to find how much the channels are shifted goes as follows:
Place both channels on top of each other, first shifted by -2, then by -1, then unshifted, then shifted by 1 and then by 2. For each shift, calculate the sum of the square of the difference of a sample from the first channel and a sample from the second channel. Choose the shift with the smallest sum. That shift will tell the time difference between the sound reaching each microphone. Since we only deal with 5 discrete cases, the accuracy is not that good. We kind of divide 180° into 5 sectors. The code goes like this:
smallest = 1E38;
for (int i = -2; i <= 2; i++)
{
summer = 0;
for (int j = 2; j < FRAME_SIZE * 2 - 2; j ++)
{
summer += (interpolatedL[j] - interpolatedR[j + i]) *
(interpolatedL[j] - interpolatedR[j + i]);
}
if (summer < smallest)
{
smallest = summer;
smallest_i = i;
}
}
printf("\n\r%d, %f", smallest_i);
Instead of increasing the sampling frequency, we could interpolate some values - kind of guessing what the values would be, if we increased the sampling frequency. I used a cubic interpolation method, which increased the sectors from 5 to 17. We have the middle sector straight ahead and 8 sectors on both sides. They don't divide the 180 degrees evenly, due to trigonometrics. The following table shows the direction of each sector in degrees (using the directions I defined in the image earlier):
0 0
1 29
2 41
3 51
4 60
5 68
6 76
7 83
8 90
9 97
10 104
11 112
12 120
13 129
14 139
15 151
16 180
From #0 to #1 we have 29 degrees and from #7 to #8 we have 7 degrees. It kind of suits a robot tracking the sound. When straight ahead (#8 or 90 degrees), we have a better resolution. When the sound is 90 degrees off, the robot can make bigger turns to quicker get on right track.
The function for detecting the sound directionThere's a sample project for the CY8CPROTO named PDM_PCM_Audio. I started with that project (checking that it worked as such). A part from the main.c:
/* Define how many samples in a frame */
#define FRAME_SIZE (512) // 1024
/* Noise threshold hysteresis */
#define THRESHOLD_HYSTERESIS 3u
/* Volume ratio for noise and print purposes */
#define VOLUME_RATIO (4*FRAME_SIZE)
/* Desired sample rate. Typical values: 8/16/22.05/32/44.1/48kHz */
#define SAMPLE_RATE_HZ 16000u
/* Decimation Rate of the PDM/PCM block. Typical value is 64 */
#define DECIMATION_RATE 64u
/* Audio Subsystem Clock. Typical values depends on the desire sample rate:
- 8/16/48kHz : 24.576 MHz
- 22.05/44.1kHz : 22.579 MHz */
#define AUDIO_SYS_CLOCK_HZ 24576000u
/* PDM/PCM Pins */
#define PDM_DATA P10_5
#define PDM_CLK P10_4
I changed the frame size from 1024 to 512, just to make the program faster. I need lots of speed later on. For just detecting the sound direction, an even smaller frame size might be possible. The sample rate is 16 kHz.
From the main() function I moved the crucial parts to a separate function:
int direction(int *significance);
This function listens to the microphone and returns a value in the range 0 to 16, which tells what it thinks is the direction of the sound. The function writes also to the variable significance a value telling how distinctive the recorded sound was. 0 means no sound ready, 1 means unsignificant background noise and 100 means very clear and constant sound source with a distinctive direction.
Interpolating a sound waveThis is probably a bit overkill, but just because of the beauty of it, I used a cubic interpolation to get a more accurate value of the time shift between the left and right channel. The original sample was 512 points, divided into 256 points for left channel and 256 points for right channel. So if the sound wave looked like this:
... I wanted to find 3 values between each point. the idea is to turn this into a Beziere curve:
Tis is what Inkscape gave me, when I just turned the nodes into smooth nodes. But that's a graphical approach. I want something that still looks like a sound curve. Each node has these tangents and the endpoints of each tangent, which define the slope and curvature of the curve at the node. What I did was defining the slope to be parallel with a line going through the two neighbouring nodes. And placing the end points at one third of the horizontal distance between the nodes:
Each tangent is parallel to the line through the two neighbour nodes. The end node (the left most node in the image) has only one neighbour. I define its tangent to go through the left end point of the next node tangent (blue line). Further overkilling would be to actually test this interpolation method on real sound recordings. One would sample at 96 kHz, then convert the recorded sound to 24 kHz, then "recreate" a 96 kHz version using thiskind of interpolation and compare the interpolated wave with the original.
Anyway, after finding these tangents of each node, I can resample the sound in any sampling frequency. What I'm interested in is finding three values between each original sampled point. An interpolation function will give them to me:
float splinterpol(int16_t y0, int16_t y3, float sly0, float sly1, float t)
{
float u = 1.0 - t;
float y1, y2;
y1 = y0 + sly0 / 3.;
y2 = y3 - sly1 / 3.;
return y0 * u * u * u + y1 * u * u * t * 3 + y2 * u * t * t * 3 + y3 * t * t * t;
}
This is the cubic interpolation. We have four values, y0, y1, y2 and y3. y0 and y3 are two nodes (or two consecutive samples). y1 and y2 are the end points of the tangents inbetween. They are calculated from sly0 and sly1, which are the slopes of the tangents. t is a value between 0 and 1 defining which value we want of the curve between the two samples. Since we want three values, we call this function with t=0.25, t=0.5 and t=0.75.
17 directionsSo, with a mic distance of 41 mm and a sampling frequency of 16 kHz, we can have max 2 samples of time shift. With interpolation, we can increase the sample rate virtually. Adding 3 interpolated values between each real value, the time shift is max 8 samples. This gives 17 distinct directions, as the following video shows:
In the video I play some Thad Jones big band music, but respecting the rights to the music, I decided to mute the video. I used the serial plotter in Arduino IDE to plot the measured directions. The video shows that the interpolation works perfectly. Moving the sound source smoothly results in the value going through each value. The five original direction categories without interpolation were 0, 4, 8, 12 and 16. If 17 categories is not enough, further interpolation will increase the categories. But it will need more memory and more calculation power. And that's the wrong path to go. There are alternatives to find the time shift.
The whole main.cThe work in progressThis was supposed to be a completed project for the Hackster contest arranged by Infineon. Due to hardware failure, it wasn't completed.
Comments