Picovoice Aims to Deliver Snappier LLM-Powered AI Voice Assistants with Orca
By beginning speech synthesis while the LLM is still generating tokens for its response, Orca eliminates awkward delays.
Lightweight machine-learning voice recognition and synthesis specialist Picovoice is aiming to make artificial intelligence (AI) systems driven by large language models (LLMs) like OpenAI's ChatGPT more natural β by giving them a voice that doesn't pause before delivering its response.
"Latency is a major drawback of LLM-based voice assistants," Picovoice explains of the problem its Orca streaming text-to-to-speech engine aims to solve. "The awkward silence when waiting for the AI agent's response defeats the use of cutting-edge genAI [generative artificial intelligence] to create humanlike interactions. The root cause is the combined delay of the LLM generating the response token-by-token and then the text-to-speech (TTS) synthesizing the audio."
Currently, the company explains, the voice assistant industry has focused on speeding up the speech synthesis stage β while ignoring the far longer delay in the LLM, which responds to a prompt by chaining tokens into a plausible if not always factual response, generating the text to feed it.
Orca, by contrast, takes what Picovoice calls a "Plan Ahead, Don't Rush It" (PADRI) approach to the problem. Rather than waiting for the LLM to finish generating its response in full, Orca begins speaking during generation β meaning a near-two-second pause present in OpenAI's own text-to-speech service is cut to 0.19s.
"Orca isn't necessarily faster than OpenAI's TTS [at speech synthesis]," the company explains. "It may even be slower because OpenAI TTS runs on a data-center-grade NVIDIA GPU, while Orca TTS in [our] demo runs on a consumer-grade x86 AMD CPU. Yet, since Orca can start much earlier, it finishes reading before OpenAI TTS can even start."
More information on Orca is available on the Picovoice website.