LLMs Learn to Phone a Friend
Co-LLM combines a general-purpose chatbot with expert models via a token-level interleaving process to produce hallucination-free responses.
The huge amount of knowledge encoded in large language models (LLMs) makes these algorithms suitable for assisting with a wide range of tasks spanning from text summarization to complex reasoning and mathematics. But as we continue to expand the use cases for LLMs, we are finding that they have definite limits. In general, models that are trained on huge bodies of text tend to become a jack of all trades, but a master of none. They can wax eloquent about nearly any topic, but are likely to drop in some falsehoods when it comes to the specifics.
An approach commonly used to overcome these challenges involves training a model heavily on data from a particular domain to give it expert-level knowledge in that area. These finely-tuned algorithms are much less likely to make a mistake when asked questions about their area of expertise, yet they are not general-purpose models. Accordingly, they are not very useful as chatbots that a user can ask a wide range of questions to.
Researchers at MIT have recently developed a technique called Co-LLM that gives us the best of both worlds — a general-purpose chatbot with a wide range of knowledge and subject matter experts that avoid issues with hallucinations. When a Co-LLM system is queried, it will use a novel switching mechanism to determine which parts of the response should come from each model that it is able to interact with. This produces a single, coherent response to the user’s query that is much less likely to contain errors than when using only a general-purpose LLM.
The Co-LLM approach relies on a small base model that is trained to recognize when it needs to defer to a larger or more specialized model to produce the best response. This deferral process happens at the token level, so the final result is composed of interleaved tokens from all available models. Each token is selected because it is the most likely to be correct, or because it provides more information than the other options. By using this approach it was found that the models organically learned to collaborate, in much the same way that humans learn when to call on an expert to help them fill in details that they are unfamiliar with.
To test out their methods, the researchers paired a base model with a Meditron model, which is tailored to answering biomedical questions. It was discovered that Co-LLM could then answer medical questions in great detail, even spelling out the mechanisms that cause a particular disease. A similar experiment was conducted in which a Llemma model was leveraged to help with math questions. It was shown that where the base model would have provided incorrect answers, Llemma was able to insert the correct answer into the response.
At present, deferral to expert models is governed by a predetermined threshold value. An appropriate threshold value may vary by task, so it is not entirely clear how to set this variable in a general way. It was also noted that if an expert model is not well-tuned, it can break the generation of text and lead to a cascade of errors being introduced. The team is currently working to solve these problems.