A Turbo Mode for AI

Mercury LLMs have taken a cue from diffusion-based models, like text-to-image generators, to speed up execution by up to 10 times.

Diffusion large language models are faster than traditional options (📷: Inception Labs)

At the beginning of the 1980s, when the personal computer revolution was still in its infancy, Steve Jobs' analogy that computers are like bicycles for the mind may have seemed just a tad far-fetched. Pac-Man is great and all, but these early machines were extremely limited. However, the most recent artificial intelligence (AI) boom has changed the vibe completely. The latest batch of generative AI tools, in particular, has given rise to a widespread belief that Jobs’ analogy has finally started to ring true. These applications augment our natural abilities to give us a major boost in efficiency and productivity.

Large language models (LLMs) are perhaps the most used of these new tools, as they can assist with anything from research to language translation and robot control systems. But, at least when it comes to commercial-grade tools, LLMs are major resource hogs. They require massive and expensive clusters of GPUs to handle requests, so only large organizations can host them. We know that LLMs are useful, but given these realities, figuring out how to make a profit from them is still a work in progress.

Do you have a need for speed? (📷: Inception Labs)

Advances in optimization techniques are certainly helping, but so far they alone are not sufficient. A team at Inception Labs believes that the best path forward is not optimizations, but a complete redesign of the traditional LLM architecture. At present, these models generate their responses one token at a time, from left to right. A given token cannot be generated until the previous token has been determined, and each token is determined by evaluating a model with billions or trillions of parameters. This is why so much compute power is needed — the algorithm is just very, very heavy.

To sidestep this situation, the team borrowed a page from another modern generative AI tool — the text-to-image generator. These models use an element called a diffuser that takes a noisy initial image, then iteratively adjusts the pixels until the image that was requested emerges. This is not done sequentially, one pixel after another, but rather the entire image is tweaked in one shot. Inception Labs wondered if instead of pixels, this technology could be applied to tokens to produce a faster LLM.

Their work in this area resulted in the development of the Mercury family of diffusion LLMs. At speeds of over 1,000 tokens per second on an NVIDIA H100 GPU, Mercury models are up to ten times faster than traditional LLMs.

Performance compares favorably with other models (📷: Inception Labs)

The team’s first model to be publicly released is Mercury Coder, which as you may have guessed is tailored to code generation tasks. When compared with other leading LLMs, the Mercury models compare very favorably across a battery of benchmarks. The comparisons are all against mini versions of existing models, however, so how Mercury’s performance compares to flagship models is not known.

If you are looking for a new option to speed up LLM execution, Mercury models are available either via an API or an on-premises deployment. More information is available at Inception Labs.

machine learning

artificial intelligence

energy efficiency

Nick Bild

R&D, creativity, and building the next big thing you never knew you wanted are my specialties.

A Turbo Mode for AI

Mercury LLMs have taken a cue from diffusion-based models, like text-to-image generators, to speed up execution by up to 10 times.

Latest articles

Sponsored articles

Related articles

Latest articles

Related articles