An LLM So Good, It’s Scary
SPIRIT LLM leverages internet-scale text data in conjunction with the expressiveness of speech to create a model that better understands us.
After being trained on massive, internet-scale datasets, large language models (LLMs) with billions of parameters, such as Llama 2 and GPT-4o, have been able to achieve impressive general-purpose language understanding and generation capabilities. These models are nothing if not versatile — they can perform a wide range of tasks from text summarization to translation and even complex reasoning tasks. However, a notable limitation of text-based LLMs is that they often miss the nuances that are present in verbal communication. Important emotional cues, tone, and style — key elements in conveying meaning in human interactions — are ignored.
On the other hand, speech-language models (SpeechLMs) are trained specifically to handle spoken language, which includes not only the words themselves but also their delivery, with differences in pitch, intonation, and emotional content. These models are particularly useful in applications like automatic speech recognition, text-to-speech, and translation. However, SpeechLMs tend to be specialized for specific tasks, which limits their ability to generalize across different types of linguistic tasks in the way text-based LLMs can. Because they often focus on particular datasets, they lack the broad adaptability of text-based models.
A team led by researchers at Meta AI has recently created what they call SPIRIT LLM, which seeks to address the shortcomings of both text-based LLMs and speech-language models by combining the strengths of each. SPIRIT LLM was trained on interleaved speech and text data, allowing it to understand and generate both text and speech while retaining the expressive qualities of spoken language. This dual capability makes it more effective at tasks that require both language understanding and expression across modalities.
SPIRIT LLM was built on top of a text-based model, LLAMA 2, and was further trained with a mixture of text-only, speech-only, and aligned speech-text datasets. The speech data was tokenized using HuBERT tokens, which are designed to capture phonetic information. The model interleaves speech and text data at the word level during training to help it learn how these two modalities align, enabling better text-to-speech and speech-to-text transfers.
The model comes in two versions: BASE and EXPRESSIVE. SPIRIT LLM BASE uses only HuBERT tokens for speech representation, providing a strong foundation for tasks that involve both text and speech processing. On the other hand, SPIRIT LLM EXPRESSIVE extends this by adding pitch and style tokens to capture the expressiveness of speech. The pitch tokens are derived from the fundamental frequency of the speech, while style tokens are extracted from features that convey the expressive characteristics of speech, such as emotion or intonation. These additional tokens allow the model to understand and generate speech that is not only phonetically accurate but also emotionally expressive, a key advancement over models that focus solely on text.
Finding and curating datasets for multimodal models is still a big challenge. As such, SPIRIT LLM was not able to perform as well as Llama 2, even when working with text-only data. This problem will have to be addressed in the future to keep this line of research progressing. Working with larger models may help with this situation — to date, the team has only experimented with 7 billion parameter models, which are relatively small in the world of LLMs.