Poster
in
Workshop: AI4MAT-ICLR-2025: AI for Accelerated Materials Design
Lifting the benchmark iceberg with item-response theory
Mara Schilling-Wilhelmi · Nawaf Alampara · Kevin Maik Jablonka
Keywords: [ Evaluation ] [ LLM ] [ Benchmark ] [ IRT ]
The evaluation of large language models (LLMs) through benchmarks has become a cornerstone of AI development, guiding critical decisions about model deployment and research directions.However, as benchmarks evolve from narrow task-specific assessments to broad capability evaluations, they become more difficult to develop, understand and analyze.Here, we report a \enquote{benchmark iceberg} phenomenon --- where much of the variability in model rankings stems not from true capability differences, but from hidden implementation choices beneath the surface of reported scores. Our analysis demonstrates how minor changes to these implementation details can alter model rankings --- a concerning finding given benchmarks' role in shaping the AI landscape.To address this, we leverage psychometric principles from educational testing. By adapting item response theory (IRT) we transform benchmarks from opaque leaderboards into transparent measurement instruments, revealing how hidden implementation choices currently distort our perception of model capabilities.