I will test OpenAI Whisper audio transcription models on a Raspberry Pi 5. The main goal is to understand if a Raspberry Pi can transcribe audio from a microphone in real-time.
I will test TinyBaseSmallMediumLarge models and compare the results.
If you are a bit lazy to read this article, you can watch it 😁
Let’s start with understanding what real-time transcription is in the following example.
How live-time transcription will work?When Rick says something, a sound wave goes to a microphone. The microphone converts it into an electric signal and sends it to a Raspberry PI.
The Raspberry PI writes this input stream into a WAV file in a non-compressed audio file format.
It can write a lot of data into one big file, but I want to transcribe audio while someone speaks, not after, let's say a day. So, after each 10 seconds of recording, I close the file and add it to a queue. The sound recording continues to write the stream into a new file.
The audio transcriber then gets the first element from the queue and transcribes it. It saves the result text from Whisper into a file or a database and grabs the following audio file from the queue.
Straightforward.
But as usual, it’s not so simple.
The real-time transcription could work if the recognition time is less than a recording chunk (in our case, 10 seconds). The queue will have 1-2 files in line, and everything will go smoothly.
However, if transcription takes more than a recording chunk, we will have an overcrowded queue that constantly grows and is never fully executed.
The OpenAI Whisper model can use a CPU or Nvidia graphics card. Unfortunately, Raspberry PI has no Nvidia chips, and Whisper doesn’t work with TPUs like Google Coral.
So, I will run it on a CPU only.
The Raspberry PI 5 also has a 64-bit 2.4 GHz quad-core ARM processor.
Can Raspberry PI accomplish a real-time transcription through Whisper on a CPU?
And that’s what we are going to find out next.
OS setupI'll install the latest Debian image using the Raspberry Pi Imager. There's nothing special—everything is straight out of the box.
Then, I'll insert the SD card into the Raspberry Pi and plug in an external USB microphone.
Next, I'll connect to the Raspberry Pi via SSH. The Pi Imager sets the username and password, so I'll use those. Then, I'll clone my repository, which has all the required code and scripts.
git clone https://github.com/Nerdy-Things/openai-whisper-raspberry-pi.git
Once downloaded, I'll navigate to the openai-whisper-raspberry-pi
folder and then to its subfolder named system
. The script instal.sh
will set up all the dependencies, mainly several Python libraries like numpy
and torch
, which Whisper
requires.
cd openai-whisper-raspberry-pi/system
./install.sh
The content of the file:
sudo apt update
sudo apt-get install -y ffmpeg sqlite3 portaudio19-dev python3-pyaudio
pip install numpy==1.26.4 --break-system-packages
pip install -U openai-whisper --break-system-packages
pip install pyaudio --break-system-packages
pip install pydub --break-system-packages
Setting up the test environmentNow, let's talk about Whisper AI models. They come in various sizes: large, medium, small, base, and tiny.
For the first test, I'll use the medium
English-only model. The large model requires 10GB of memory, but I have a budget version of Raspberry PI with only 4GB, so it won't run.
In this experiment, I'll open several terminal windows for the test:
- The first window will run and show the AI transcription process.
- The second window will handle the audio recording.
- The third window will display the transcribed text.
- The fourth window will show memory usage and CPU information.
- A YouTube video that will play some recordings for transcription.
The laptop plays audio, and Raspberry PI listens to it and transcribes it into a local file.
All the results I will see in terminal windows that are connected to a Raspberry PI (ssh).
For recognition, I am using the next simple code (prints were removed for easiness):
import whisper
from time_util import TimeUtil
class AiWhisper:
_models = ["tiny.en", "base.en", "small.en", "medium.en"]
_model = None
def __init__(self, model_index: int = 0):
self._model = whisper.load_model(self._models[model_index])
def transcode(self, file_path: str):
TimeUtil.start("AiWhisper transcode")
result =self._model.transcribe(file_path, fp16=False, language='English'
return result["text"]
For the audio recording pyaudio
and wave
:
https://github.com/Nerdy-Things/openai-whisper-raspberry-pi/blob/master/python/audio_recorder.py
Python doesn't support normal multithreading, so for simplicity, I am running 2 Python processes in 2 separate terminals to do work in parallel by running these files:
https://github.com/Nerdy-Things/openai-whisper-raspberry-pi/blob/master/python/daemon_ai.py
https://github.com/Nerdy-Things/openai-whisper-raspberry-pi/blob/master/python/daemon_audio.py
Medium.en modelI'll run a Python script called python/daemon_ai.py
to start the transcription queue with integer 3
as an argument.
cd openai-whisper-raspberry-pi/python
python daemon_ai.py 3
It will download the medium.en
model and attempt to open it. However, the Raspberry Pi will freeze.
I'll have to reboot it manually. After re-establishing the SSH connection, I'll demonstrate why this happens. By running the htop
command and the Python script again,
You can see that the model consumes all 4GB of memory and a swap file, eventually causing the system to freeze.
That's it for the medium model. It requires a more powerful device. Let's move on to the small model.
Small.en modelI'll use the same terminals but with an argument of 2
this time.
python daemon_ai.py 2
The model will take some time to download, so I'll speed this up. Initializing the model takes about 2 minutes—it's just a start-up action, but it is still significant. Half of the memory is used, leaving 2GB free.
The AI queue indicates there’s no audio, so I'll start the audio recording in the second window. In the second window, I'll run the following:
cd openai-whisper-raspberry-pi/python
python daemon_audio.py
After, I will play the YouTube video for transcription.
The audio recorder creates chunks that are 10 seconds long. These recordings are added to a queue and stored in a data folder with the recording date. I can open this in the third window. When the AI transcriber processes the first audio chunk, it will create a text file.
In the third window, we will open the data folder:
cd openai-whisper-raspberry-pi/data
Then, open a path with the current date. For me, it's:
The text will indicate that the AI has processed the first item from the queue. The full text of the video chunk was saved to a file in a data folder.
Let’s review how fast it was processed on a Raspberry PI.
As we can see, we have ten elements in the queue that are waiting for a transcription.
Processing a 10-second audio chunk takes over 30 seconds.
In other words, I record three audio chunks while whisper transcribes one.
That's unfortunate.
It works, but it makes live transcription impossible with these timings because the queue will grow infinitely.
Base.en modelI am using the same set of windows and the command daemon_ai.py
. But this time, I will use a base model with an index of 1.
python daemon_ai.py 1
The library downloads the required files for the first time. In htop
output, the system and whisper consumed eight hundred megabytes of memory, which is pretty low.
Then I'll run audio in the second window:
python daemon_audio.py
And I will start the YouTube video.
As a result, I have far more transcribed text than in the previous case, and that’s a good sign.
If we look at the queue, we will see that it has only three elements, compared to 10 in the previous case.
The transcription time of each chunk is nearly 10 seconds. Sometimes less, sometimes more. But overall, the queue is also growing.
So even if we have approximately the same time for recording and transcribing, this slight difference makes it unsuitable for live transcribing.
Theoretically, we can win this time on IO operations like writing to the memory instead of the SD card.
However, it’s not the topic of this video. So the verdict is “kind of possible, but needs adjustments.”
Let’s move on to the final test of the tiny model.
Tiny.en modelAgain, the same set of windows and the same commands. The index of the tiny model is zero.
python daemon_ai.py 0
Looking into htop
.
Now, approximately 700 megabytes are used, which is similar to what we've seen with the base model.
I'll test the video and collect some data.
After the same video chunk, I have even more text than in the previous test. I ended the video at exactly the same moment, but here we have a few more sentences.
Next, let's look at the queue. It has zero elements that are waiting for their turn. A transcription process takes approximately 6 seconds on average for a 10-second audio.
It looks like we can transcribe more than real-time audio.
It's a win.
Also, if we look at the transcription quality, we will find that it understands everything fine. Of course, Robert Dawny Jr. has excellent pronunciation, and it was easy for a tiny model, but I am not trying to test the model's quality.
I am trying to test performance.
And the performance is good.
We can use a Raspberry PI five and a ChatGPT whisper without additional graphic cards for live transcription.
And that's cool.
Let me know what you think in the comments.
Thank you for reading!
Comments