BackgroundInformation
VocalVision is a speech interpreter that can listen into meetings or hard-to-follow conversations using one mic and recognize who is speaking as well as transcribe what they are saying onto a display.
This device can have a majority of applications, the intended one being aiding deaf people in the interpretation of large-scale in-person meetings. Generally, it is immensely difficult to follow who is speaking and what they are saying when there are so many people participating in a conversation, so visualizing this onto a display can be of immense help.
UtilizationOn startup, VocalVision will give you the instructions on how to use the device. Then, you are given the option to begin recording audio. Once you press the stop button, it analyzes the audio on the PocketBeagle and displays the transcription. Then it sends the recorded file to the host PC for heavier classification processing. Here, the audio file gets sent to be compared with a pre-made model, and the final prediction is output onto the display.
WiringFirst, connect the PocketBeagle's 3.3V (P2_23) to the + rail and the GND (P2_21) to the - rail on the breadboard. In the following instructions arrows go from Peripheral ---> PocketBeagle
Now connect the USB breakout. Make sure your VBUS and VIN pins are jumped as well as ID and GND on the PocketBeagle:
- VBUS ---> VIN/VBUS
- D- ---> DN
- D+ ---> DP
- GND ---> ID/GND
Next, connect the TSC2007 to the PocketBeagle.
- VIN ---> + rail
- GND --> - rail
- SCL ---> P2_09 (Be sure to add a 220 ohm pullup resistor to the power rail)
- SDA ---> P2_11 (Be sure to add a 220 ohm pullup resistor to the power rail)
Now connect the SPI screen to the PocketBeagle. Before doing so, bridge IM1/IM2/IM3 by soldering:
- GND ---> - rail
- VIN ---> + rail
- CLK ---> CLK (P1_08)
- MISO ---> MISO (P1_10)
- MOSI ---> MOSI (P1_12)
- CS ---> CS (P1_06)
- D/C ---> D/C (P1_04)
- RST ---> RST (P1_02)
Finally, connect the +X, -X, +Y, -Y ports on the SPI side of the screen to the corresponding ports on the TSC2007.
Now plug in the Network Adapter and USB microphone into the USB hub and plug this into your PocketBeagle.
Configuration
Ensure your SD card is flashed, and now you can begin configuration.
Install python and Adafruit:
sudo apt-get update
sudo apt-get install build-essential python-dev python-setuptools python-smbus -y
sudo apt-get install python-pip python3-pip -y
sudo apt-get install zip -y
sudo pip3 install --upgrade setuptools
sudo pip3 install --upgrade Adafruit_BBIO
sudo pip3 install adafruit-blinka
To set up the network adapter, run the following commands:
$ sudo connmanctl
>> enable wifi
>> scan wifi
>> services
>> agent on
>> connect <network>
>> services
>> quit
Next, set up the microphone:
$ sudo apt-get update
$ sudo apt-get install -y swig libpulse-dev libasound2-dev
$ nano ~./asoundrc
$ sudo apt-get install -y python-pyaudio python3-pyaudio
$ arecord -f S16_LE -r 48000 -d 5 test.wav
Now download the SPI screen requirements:
$ sudo apt-get update
$ sudo pip3 install --upgrade Pillow
$ sudo pip3 install adafruit-circuitpython-busdevice
$ sudo pip3 install adafruit-circuitpython-rgb-display
$ sudo apt-get install ttf-dejavu -y
Now download touchscreen requirements:
$ sudo pip3 install adafruit-circuitpython-tsc2007
$ sudo pip3 install circup
$ sudo circup install adafruit_tsc2007
Now install the required audio processing libraries:
$ sudo pip3 install SpeechRecognition
$ sudo pip3 install soundfile
$ sudo pip3 install sounddevice
Now, on your host computer and the PocketBeagle, clone the git repository. Then, on the host computer, install the all the requirements listed on the requirements.txt file for pyAudioAnalysis. The reason why we are using a host computer rather than performing the classification on PocketBeagle is that it does not support some of the library versions required for pyAudioAnalysis.
Host Computer Pre-Processing
Since this project relies on data to develop an audio classification model, you need to collect speech samples from whoever you'd like in your database. I asked 8 of my classmates for ~1.5 minute long speech samples. I asked them to recite the first 3 lists of Harvard sentences because they have accurate speech distribution. Make sure these are.wav files. Next, in the data folder of pyAudioAnalysis, create a folder for each voice sample and place the.wav file in it. Run the splitter.py code with the correct file names to split your audio into 10 second long wav files. An important thing to note is the minimum amount of.wav files you have in a given folder puts a restriction on the maximum number of people in your database. In other words, if you want more people to be in the model, you need to have much longer speech samples.
After you've compiled the samples, you can now generate the model you will use to classify inputs. Be sure you are in the pyAudioAnalysis folder when you run this command in your terminal:
$ python audioAnalysis.py trainClassifier -i <directory1> ... <directoryN> --method <svm, svm_rbf, knn, extratrees, gradientboosting or randomforest> -o <modelName>
In my case I ran:
$ python audioAnalysis.py trainClassifier -i data/Brendan/ data/Peter/ data/Sunny/ data/Parnika/ data/Sahana/ data/Mr.Welsh/ --method <svm> -o svmVoice
Now you are ready to run!
Running the Program
On your host computer, run the file_transfer.py script. Configure your PocketBeagle to run the main.py script on startup, and you have just built VocalVision!
Next Steps
I really hope to make this device work in real time, meaning like live captions, while the speaker is talking, it'll be able to analyze your speech and print out what you are saying as well as who it has classified it as. Additionally I'd love to be able to run this without the need of a host computer.
Comments