Teach2Eval: An Interaction-Driven LLMs Evaluation Method via Teaching Effectiveness
Abstract
Recent progress in large language models (LLMs) has outpaced the development of effective evaluation methods. Evaluating LLMs with static, task-specific benchmarks is increasingly fragile due to contamination and saturation, and it fails to capture interactive reasoning. We introduce Teach2Eval, which reframes evaluation as teaching: a candidate model guides weaker students, and the students’ gains constitute the score. This interaction yields robustness to contamination and exposes orthogonal abilities with fine-grained metrics across Application, Judgment, Guidance, and Reflection. The framework scales automatically by exploiting natural error distributions from weak students, requiring neither bespoke rubrics nor human graders. Across 30 LLMs and 60 datasets, Teach2Eval achieves Spearman above 0.95 with human-preference leaderboards (e.g., Chatbot Arena/LiveBench), surpassing direct baselines, while offering actionable training signals (capability hierarchies, early overfitting) at low cost.