NVIDIA Releases Ultra-Compact Automatic Speech Recognition Model, QuartzNet

Achieves similar accuracy to the Jasper model in as few as 6.4 million parameters — down from 201 million.

While the model is edge-suitable, the training required NVIDIA's DGX-2 SuperPOD systems. (📷: NVIDIA)

NVIDIA has announced the release of QuartzNet, an end-to-end neural automatic speech recognition (ASR) model which it claims is small enough to implement at the edge — meaning that lower-specification devices wouldn't need to offload speech recognition to remote servers.

"As computers and other personal devices have become increasingly prevalent, interest in conversational AI has grown due to its multitude of potential applications in a variety of situations," the researchers — Stanislav Beliaev, Boris Ginsburg, Jocelyn Huang, Oleksii Kuchaiev, Vitaly Lavrukhin, Ryan Leary, Jason Li, and Yang Zhang, who built on an initial model designed by University of Illinois Urbana-Champaign intern Samuel Kriman — explain.

"Each conversational AI framework is comprised of several more basic modules such as automatic speech recognition (ASR), and the models for these need to be lightweight in order to be effectively deployed on the edge, where most of the devices are smaller and have less memory and processing power. However, most state-of-the-art (SOTA) ASR models are extremely large — they tend to have on the order of a few hundred million parameters. This makes them hard to deploy on a large scale given current limitations of devices on the edge."

The solution: QuartzNet, a new end-to-end neural model based on Jasper but considerably smaller - between 6.4 million and 18.9 million parameters, down from 201 million for the smallest Jasper model. "QuartzNet replaces Jasper’s 1D convolutions with 1D time-channel separable convolutions," the team shares, "which use many fewer parameters. This means that convolutional kernels can be made much larger, and the network can be made much deeper, without having too large of an impact on the total number of parameters in the model."

Despite being small enough to run on edge devices, QuartzNet is claimed to offer impressive accuracy: a 2.6 percent word error rate (WER), based on the LibreSpeech test-clean dataset and Transformer XL-rescoring.

The team has released a paper on QuartzNet, as well as making both the source code and pre-trained models available via the Neural Modules (NeMo) toolkit on the NVIDIA GitHub repository.

Gareth Halfacree
Freelance journalist, technical author, hacker, tinkerer, erstwhile sysadmin. For hire: freelance@halfacree.co.uk.
Latest articles
Sponsored articles
Related articles
Get our weekly newsletter when you join Hackster.
Latest articles
Read more
Related articles