ASR Gets a Shot of Moonshine

Moonshine is a text-to-speech model that has been optimized for fast, accurate on-device speech recognition — and it outperforms Whisper.

nickbild
about 2 months ago Machine Learning & AI
Torre uses a Moonshine model for real-time translations (📷: Useful Sensors)

They may be tried and true, but keyboards and touchscreens are not always the ideal input devices. For applications ranging from live translation to accessibility tools, personal assistants, and smart home devices, voice control is often much more natural and efficient. Or at least it could be. The problem is that many automatic speech recognition algorithms — the top performing ones, anyway — require substantial computing horsepower for operation. As such, requests are typically sent to a cloud-based service for processing, and that can mean waiting several seconds for a response.

That delay does not make for a good user experience. In a smart home, this delay might be little more than a minor annoyance. But in the case of live translation, it may serve to disengage those involved in the conversation and make it difficult to communicate. The team at Useful Sensors took on this problem recently and came up with a novel speech-to-text model called Moonshine that has been optimized for fast and accurate automatic speech recognition on resource-constrained devices. The flexibility of this approach allows it to outperform even state-of-the-art models like OpenAI’s Whisper.

Moonshine excels with short audio clips (📷: N. Jeffries et al.)

Traditional approaches, such as Whisper, do achieve high accuracy levels, but face significant latency issues, especially when deployed on low-cost hardware. Furthermore, Whisper’s fixed-length encoder-decoder transformer architecture requires 30-second chunks of audio input, padding shorter segments with zeros, resulting in a constant processing overhead. This setup imposes a firm lower bound on latency — in Whisper’s case, around 500 milliseconds even for shorter audio inputs.

The Moonshine family of models aim to preserve Whisper’s accuracy while improving computational efficiency by adopting a variable-length processing approach. Moonshine eliminates the need for zero-padding, thereby scaling processing requirements in proportion to the actual audio input length. This adjustment allows Moonshine to avoid the fixed overhead of Whisper’s architecture, which empirical testing showed could yield up to a 35x speed-up in ideal conditions and approximately a 5x speed-up overall.

Moonshine model architecture (📷: N. Jeffries et al.)

Moonshine has already moved from theory to practice with Useful Sensors’ recent release of a device called Torre. It is a dual-screened tablet that was designed from the ground up for live translation tasks. The idea is that people can sit across from one another and speak in their own language, and the other person’s display will show a translation of what is being said in real-time. Speed is crucial for such an application, as is privacy — which is another strike against cloud-based services — so Torre runs a Moonshine model directly on-device.

Benchmarks show that Moonshine has a slight edge on Whisper in terms of word error rate, in addition to the significant speed increases. If you would like to give a Moonshine model a whirl for yourself, source code and model weights have been made available through GitHub under a permissive MIT license. Happy hacking!

nickbild

R&D, creativity, and building the next big thing you never knew you wanted are my specialties.

Latest Articles