MLCommons Launches a "First-of-its-Kind" Benchmark to Measure and Monitor LLM AI Safety
Building on April's proof-of-concept release, AILuminate scores large language models' responses across 12 hazard categories.
Engineering consortium MLCommons has announced what it claims as the "first-of-its-kind" benchmark designed to measure the safety, rather than performance, of large language models (LLMs): AILuminate.
"Companies are increasingly incorporating AI [Artificial Intelligence] into their products, but they have no standardized way of evaluating product safety," explains MLCommons president and founder Peter Mattson of the problem the consortium aims to solve. "Just like other complex technologies like cars or planes, AI models require industry-standard testing to guide responsible development. We hope this benchmark will assist developers in improving the safety of their systems, and will give companies better clarity about the safety of the systems they use."
MLCommons' AILuminate benchmark is designed, the group claims, to assess responses from large language models (LLMs) to over 24,000 pre-written test prompts β 12,000 of which are made public for model creators to use as practice inputs, 12,000 of which are kept private and used for the actual testing β across 12 categories of hazards including violent crimes, the creation of indiscriminate weapons, and child sexual exploitation. The LLMs' responses are then graded by a separate evaluator model, providing "safety grades" across each of the hazard categories.
"With roots in well-respected research institutions, an open and transparent process, and buy-in across the industry, MLCommons is uniquely equipped to advance a global baseline on AI risk and reliability," claims MLCommons' executive director Rebecca Weiss. "We are proud to release our v1.0 benchmark, which marks a major milestone in our work to build a harmonized approach to safer AI. By making AI more transparent, more reliable, and more trusted, we can ensure its positive use in society and move the industry forward."
AILuminate is based on a proof-of-concept benchmark, then known simply as the AI Safety benchmark, released by MLCommons back in April. It comes around a month after researchers at University of Pennsylvania's School of Engineering and Applied Science warned of the risks behind tying large language model technology to real-world physical robots, demonstrating how guardrails against malicious behavior β such as instructing a robot to transport and detonate a bomb in the most crowded area it can find β are easily bypassed.
More information on the benchmark is available on the MLCommons website, along with results from a range of popular LLMs including Anthropic's Claude 3.5 Haiku and Sonnet, which scored well, and the Allen AI OLMo 7b open mode, which was the only one marked as "poor." The benchmark has also been released on GitHub, under the Apache 2.0 license.