Google's DeepMind Aims to Fix LLMs' Lying Ways — with the FACTS Grounding Benchmark
A percentage score of how factual an LLMs responses are will, the company hopes, help address one of the technology's biggest pitfalls.
Google DeepMind, the company's artificial intelligence (AI) arm, has announced a benchmark through which it hopes to improve the factuality of responses from large language models (LLMs): FACTS Grounding.
"Large language models (LLMs) are transforming how we access information, yet their grip on factual accuracy remains imperfect," the DeepMind team admits. "They can 'hallucinate' false information, particularly when given complex inputs. In turn, this can erode trust in LLMs and limit their applications in the real world. Our comprehensive benchmark and online leaderboard offer a much-needed measure of how accurately LLMs ground their responses in provided source material and avoid hallucinations."
Large language models, which get larger with every new release, are enjoying considerable focus at present. Trained on giant datasets of often-copyright material, they work on a token-prediction system to respond to natural-language prompts — or imagery or sound, in the case of multi-modal models — with the most likely answer. The only problem: what returns is an answer shaped object, not an actual answer: LLMs lack any understanding of the prompt and its own response, acting more like an extremely convoluted autocomplete engine than anything approaching true artificial intelligence.
It's a trick, then, but a convincing one: LLM technology is taking the world by storm, and is being used for everything from summarizing web searches to providing a natural-language control system for robots. With no understanding at the center, though, the result is that an LLM-backed system will always provide what appears to be a valid answer — but which is often inaccurate and occasionally entirely fictitious.
It's here DeepMind delivers FACTS Grounding, an effort to measure how factual responses from an LLM are. Based on nearly 2,000 examples crafted to require a long-form response, in which the target LLM is told to reference a bundled document and respond to a user query, the benchmark delivers a percentage score — though its critics may wonder why Google's own Gemini models appear in the top three slots of the initial leaderboard, comfortably above models from rivals OpenAI and Anthropic.
To head off complaints of bias, DeepMind is making a "public set" of 860 benchmark documents and prompts available to all — but is keeping a set of 859 documents and prompts private. "We know that issues of benchmark contamination and leaderboard hacking are important to protect against, so following standard industry practice, we are keeping the private evaluation set held out," the team explains. "The FACTS leaderboard scores are the average performance across both public and private sets."
The FACTS Grounding public examples are available on Kaggle now, under the permissive Apache 2.0 license; the leaderboard is available on a separate page alongside starter code and a technical report into the benchmark.