Poster
in
Workshop: Building Trust in LLMs and LLM Applications: From Guardrails to Explainability to Regulation

Model Evaluations Need Rigorous and Transparent Human Baselines

Kevin Wei · Patricia Paskov · Sunishchal Dev · Michael Byun · Anka Reuel · Xavier Roberts-Gaal · Rachel Calcott · Evie Coxon · Chinmay Deshpande

Project Page [ OpenReview]

Abstract

This position paper argues that human baselines in foundation model evaluations must be more rigorous and more transparent to enable meaningful comparisons of human vs. AI performance.} Human performance baselines are vital for the machine learning community, downstream users, and policymakers to interpret AI evaluations. Models are often claimed to achieve ``super-human'' performance, but existing baselining methods are neither sufficiently rigorous nor sufficiently well-documented to robustly measure and assess performance differences. Based on a meta-review of the measurement theory and AI evaluation literatures, we derive a framework for assessing human baselining methods. We then use our framework to systematically review 113 human baselines in foundation model evaluations, identifying shortcomings in existing baselining methods. We publish our framework as a reporting checklist for researchers conducting human baseline studies. We hope our work can advance more rigorous AI evaluation practices that can better serve both the research community and policymakers. Data is available at: [GITHUB LINK].

Chat is not available.