Canopy Labs Releases Orpheus, a Permissively-Licensed LLM for Convincing Text to Speech
Three-billion-parameter "medium" model available to download now, with smaller models due in the near future.
Artificial intelligence startup Canopy Labs, founded on no smaller a mission than to build "digital humans that are indistinguishable from real humans," has released Orpheus — a family of large language models designed for text to speech generation, with the three-billion-parameter "medium" model out now.
"To date, open-source TTS [Text To Speech] models have not been competitive with closed source models," the company claims. "Nor have TTS models been capable of expressing empathy, consistent of the emotional intelligence of a human. We're introducing Orpheus, a family of state-of-the-art speech-LLMs, for human level speech generation. We demonstrate extremely high quality, aesthetically pleasing, speech generation even through very tiny model sizes."
Named for the Thracian bard in Greek mythology, known for entering Hades in search of his wife Eurydice, Orpheus is being released as pre-trained models based on Meta's Llama-3b — created using a dataset of over 100,000 hours of English speech and billions of text tokens, the company claims, without specifying if said dataset is open or not. The result is a text-to-speech model that can be prompted into conveying apparent emotions — and which can even be used for zero-shot voice cloning, a property the company claims emerged owing to the large pre-training corpus.
In addition to high accuracy, Canopy Labs is claiming that Orpheus is performant — streaming its output in real-time with a 200ms latency, dropping down to 25-50ms if input text is streamed into cache. This, the company says, is due to two key design paradigms: "We get 7 tokens per frame which we decode as a single flattened sequence rather than using 7 LM heads," the company says, "[and] we use a non-streaming (CNN [Convolutional Neural Network]-based) tokenizer."
Canopy Labs has released pre-trained and fine-tuned "medium" models with three billion parameters on GitHub and Hugging Face under the permissive Apache 2.0 license; it has also promised to release pre-trained and fine-tuned models in smaller one-billion, 400 million, and 150 million parameter sizes in the near future, for use on resource-constrained devices.