A couple of years ago, my friends and I created a bird sound classifier and posted it here. We received quite a bit of recognition for it. A lot has changed since then, but one thing that hasn't changed is the excitement around TinyML. Now, after a couple of years of inactivity, we're back with another exciting project!
There is one small change: I'm writing this as co-founder of Tiny Prism Labs, and this project is something my team and I developed over the weekend to showcase our capabilities.
EdgeGPT - LLM on the EdgeWe wanted to try out the voice-to-prompt feature of ChatGPT/Gemini, but on the edge. It's not a new project or concept, but as a TinyML company, we wanted to give it our own spin. So, we searched around the office to find what hardware we had available, and I found a Wio Terminal stashed behind all the boxes of microcontrollers. After looking at the specs, I chose it for three reasons:
- Microphone: Essential for capturing voice input.
- WiFi Module: Enables communication with our server.
- Display: Allows us to show the LLM's response.
Next, we needed to run an LLM, so we created a rough framework:
- The microphone will record audio for x amount of seconds.
- Transfer the audio data over WiFi to our laptop.
- An API running on the laptop will receive the data.
- Perform Speech-to-Text conversion.
- The generated text will be used as a prompt for the LLM.
- Send the LLM's result back to the Wio Terminal.
- Display the result on the screen.
The Seeed Studio Wio Terminal is like the Swiss Army knife of microcontrollers. This compact powerhouse packs a ton of features into its small form factor, making it perfect for embedded projects and IoT applications. Imagine a device with a built-in screen, WiFi, Bluetooth, sensors, and the ability to run machine learning models – that's the Wio Terminal in a nutshell. It's incredibly versatile, allowing you to create everything from simple gadgets to complex AI-powered devices. Whether you're a beginner or a seasoned maker, the Wio Terminal offers an accessible and powerful platform for bringing your ideas to life.
To get the display, WiFi, and microphone working, I used example programs and combined them. Since this was all done over the weekend, I wasn't concerned about reliability, and with all the code amalgamation, we got it working.
The major hurdle was getting the microphone working. Because it's an analog input, noise cancellation and data acquisition were quite difficult. Next was setting the sampling frequency and recording the audio, which was done through a DMA buffer. I sent the 3-second audio clip over a POST request to the API.
Building a Seamless Speech-to-Text API with Whisper and PicoLLMThe Flask-based application demonstrates an efficient approach to processing audio and converting it into meaningful text using OpenAI’s Whisper model and PicoLLM. The API endpoint of the application facilitates the handling of raw audio input, processing it into a usable format, transcribing it, and generating a meaningful response.
Audio Processing and Preparation:
The API processes binary data from the WIO terminal, converting it into a WAV format that meets the requirements for model input. This conversion ensures the audio is compatible and ready for further analysis. Critical steps in this process include setting parameters such as sample rate, number of channels, and bit depth to maintain consistency and high-quality audio processing.
Transcription with Whisper:
Once the audio is prepared, the application uses a locally downloaded base model from Hugging Face to perform transcription. This model processes the WAV file and generates text representing the spoken content. The transcription captures the key details of the audio, serving as a foundation for further analysis. This step ensures that the audio is transcribed locally, eliminating the need for internet connectivity and maintaining full control over the transcription process.
Text Processing with PicoLLM:
The transcribed text is then processed by the PicoLLM model for advanced text generation. PicoLLM Compression is an advanced algorithm developed by Picovoice for quantizing large language models (LLMs). Unlike traditional methods that rely on fixed bit allocation schemes, which often fall short in performance, PicoLLM Compression automatically determines the optimal bit allocation across and within the LLM's weights based on a task-specific cost function. This dynamic approach ensures more efficient compression, improving model performance and resource usage.
PicoLLM, accessed via its Python package, enables loading the required LLM model and generating tailored responses or insights based on the transcription. This step enhances the transcribed data by enriching it with meaningful responses, similar to other open-source LLM models, but operates entirely locally without requiring internet access. This ensures the LLM runs efficiently on edge devices.
Prominent packages/tools used:
- Flask: For building the API
- Whisper: Base model for transcription
- PicoLLM: Gemma-2b-it model for generating prompt-based responses
- AudioFormat: WAV file for audio processing (using the wave package)
Now onto the working of EdgeGPT
This weekend project was a fun exploration of how we can bring the power of LLMs to the edge. While there's still room for improvement in terms of reliability and efficiency, we're excited by the potential of this technology to create truly interactive and intelligent devices.
Imagine a future where tiny, low-power devices can understand and respond to our voice commands, providing us with information, assistance, and even companionship. This is the future we're building at Tiny Prism Labs, and we believe that the Wio Terminal, with its versatility and accessibility, is the perfect platform for making this vision a reality.
Stay tuned for more updates on this project and other exciting TinyML developments from Tiny Prism Labs! We're always pushing the boundaries of what's possible with embedded machine learning, and we can't wait to share our journey with you.
Comments