Canopy Labs Releases Orpheus, a Permissively-Licensed LLM for Convincing Text to Speech

Three-billion-parameter "medium" model available to download now, with smaller models due in the near future.

Artificial intelligence startup Canopy Labs, founded on no smaller a mission than to build "digital humans that are indistinguishable from real humans," has released Orpheus — a family of large language models designed for text to speech generation, with the three-billion-parameter "medium" model out now.

"To date, open-source TTS [Text To Speech] models have not been competitive with closed source models," the company claims. "Nor have TTS models been capable of expressing empathy, consistent of the emotional intelligence of a human. We're introducing Orpheus, a family of state-of-the-art speech-LLMs, for human level speech generation. We demonstrate extremely high quality, aesthetically pleasing, speech generation even through very tiny model sizes."

Canopy Labs has released pre-trained and fine-tuned 3B-parameter models of Orpheus, a new LLM for speech generation. (📷: Canopy Labs)

Named for the Thracian bard in Greek mythology, known for entering Hades in search of his wife Eurydice, Orpheus is being released as pre-trained models based on Meta's Llama-3b — created using a dataset of over 100,000 hours of English speech and billions of text tokens, the company claims, without specifying if said dataset is open or not. The result is a text-to-speech model that can be prompted into conveying apparent emotions — and which can even be used for zero-shot voice cloning, a property the company claims emerged owing to the large pre-training corpus.

In addition to high accuracy, Canopy Labs is claiming that Orpheus is performant — streaming its output in real-time with a 200ms latency, dropping down to 25-50ms if input text is streamed into cache. This, the company says, is due to two key design paradigms: "We get 7 tokens per frame which we decode as a single flattened sequence rather than using 7 LM heads," the company says, "[and] we use a non-streaming (CNN [Convolutional Neural Network]-based) tokenizer."

Canopy Labs has released pre-trained and fine-tuned "medium" models with three billion parameters on GitHub and Hugging Face under the permissive Apache 2.0 license; it has also promised to release pre-trained and fine-tuned models in smaller one-billion, 400 million, and 150 million parameter sizes in the near future, for use on resource-constrained devices.

Gareth Halfacree
Freelance journalist, technical author, hacker, tinkerer, erstwhile sysadmin. For hire: freelance@halfacree.co.uk.
Latest articles
Sponsored articles
Related articles
Get our weekly newsletter when you join Hackster.
Latest articles
Read more
Related articles