A Hot Take on LLM Trustworthiness
The "Thermometer" approach makes it practical to calibrate an LLM so that we know how much trust to place in its answers to our questions.
Having access to a large language model (LLM) is something like having a team of experts at your disposal, standing ready to answer your every question on virtually any topic. Sometimes these experts are wrong, however. Yet that never stops them from speaking confidently. That is the situation we find ourselves in with today’s LLMs. They can provide us with limitless information with previously unheard of levels of efficiency, but how do we know we can trust what they tell us? Sure, we can independently fact-check them, but spending the time checking every point defeats the purpose of using them in the first place.
This is not a new problem in machine learning. Even prior to the rise of LLMs, many types of algorithms would go through a calibration process that could then be used to give a measure of how confident they were in their predictions. Traditionally this would be done by compiling a labeled dataset, then comparing the model's predictions with the ground truth values to see where the model went astray. That worked well enough for a model designed for a specific task, but with LLMs, which can handle many different types of tasks, collecting a labeled dataset that is large and diverse enough to be useful quickly becomes impractical.
In order to efficiently calibrate an LLM, a different way of thinking is needed. One such solution has recently been put forth in a joint effort between MIT and the MIT-IBM Watson AI Lab. They have developed a method called Thermometer that can be applied to LLMs without requiring as much labeled data as other techniques. Instead, Thermometer utilizes a secondary model that works in conjunction with the primary LLM to assess the level of confidence one should have in a response.
Thermometer relies on a standard temperature scaling approach to calibration, but this takes place on a smaller, auxiliary model. Using a secondary model substantially reduces computational costs, as a modern LLM can easily have many billions of parameters. This auxiliary model is then trained on a smaller labeled dataset in a handful of areas that are representative of the tasks that the primary LLM was designed for. The Thermometer model will only require access to small portions of the LLM to predict the right temperature for calibration of a data point.
Experimentation showed that this approach was very efficient. Calibration only resulted in a 0.5 percent reduction in model execution speed. And crucially, since Thermometer does not alter the primary model, it does not result in any decrease in performance.
It was also demonstrated that Thermometer did a good job in assessing an LLMs level of certainty in its responses. When compared with other existing methods, Thermometer produced better-calibrated uncertainty measures, all while requiring far less computational resources.
Thermometer does still rely on some amount of labeled data, and it cannot generalize to any possible task that is well outside of the areas it was trained on, so it is not a perfect solution. But looking ahead, the team plans to better quantify how much data is needed, and how diverse it must be, to create a Thermometer model that can generalize to new tasks.