Can You Hear Me Now?
A new system combines the Whisper foundation model with a modified multi-modal LLaMA model to improve automatic speech recognition accuracy.
Automatic Speech Recognition (ASR) is a technology that enables machines to convert spoken language into written text. This technological innovation has found widespread applications in consumer devices, particularly in smart speakers and other digital assistants. Smart speakers, such as Amazon Echo, Google Home, and Apple HomePod, leverage ASR to understand and respond to user voice commands, making them an integral part of modern smart homes.
One of the key benefits of ASR in consumer devices is the convenience it offers. Users can control various aspects of their smart homes effortlessly through voice commands, eliminating the need for more cumbersome inputs. Moreover, ASR contributes to accessibility by enabling voice-based interfaces for individuals with disabilities, making technology more inclusive.
For ASR systems to be useful, especially in consumer devices, accuracy is of paramount importance. Incorrect transcriptions can lead to misinterpretation of user commands, resulting in inappropriate device behavior or frustrating user experiences. For instance, a misheard command might cause a smart speaker to turn all of the lights in a home off instead of on. To mitigate such issues, ASR systems must continually improve their accuracy through advanced machine learning algorithms and robust training datasets.
Many such improvements have been proposed, with two-pass approaches that feed the ASR results into a large language model for correction gaining a lot of steam lately. While these techniques have improved the state of the art, there is still plenty of room for improvement. A multi-institutional research effort led by teams at the King Abdullah University of Science and Technology and NVIDIA is seeking to further improve ASR accuracy by including additional data modalities. They reasoned that since speech recognition requires both acoustic information (e.g. sounds in the speaker’s environment) and linguistic information (e.g. domain-specific knowledge), these types of data should be captured and processed by the system.
Toward this goal, the team developed a system that they call Whispering-LLaMA. Given the name, you can probably guess that the first component is the Whisper ASR foundation model that was trained on hundreds of thousands of hours of multilingual audio data. Presented with a speech sample, this portion of the pipeline produces transcripts of the n-best hypotheses. Also implied by the name, the second piece of the system leverages the large language model called LLaMA. LLaMA is leveraged to generate error-corrected transcripts by utilizing the knowledge of language that is encoded within it. Unlike previous approaches, the language model was also modified such that it can accept features generated by the Whisper model, which provides the model with additional acoustic information to help it make more accurate predictions.
The Whispering-LLaMA approach was evaluated against a wide variety of existing ASR datasets. It was found that fusing the data modalities lead to a 37.66% improvement in word error rate relative performance. These very encouraging results suggest that the methods employed in developing Whispering-LLaMA could have value in producing a new generation of more accurate ASR tools. The team hopes that their work will encourage other researchers to further explore this possibility. They have also open-sourced all of their code and pre-trained models to give other teams a running start.
R&D, creativity, and building the next big thing you never knew you wanted are my specialties.