LiveResearchBench: Benchmarking Single- and Multi-Agent Systems for Citation-Grounded Deep Research
Abstract
Deep research---producing comprehensive, citation-backed reports by searching across hundreds of live websites---marks an important frontier for agentic systems. To rigorously evaluate this ability, three principles are essential: tasks should be (1) user-centric, reflecting realistic information needs, (2) dynamic, requiring up-to-date information beyond parametric knowledge, and (3) unambiguous, ensuring consistent interpretation across users. Existing benchmarks fall short of these principles, often focusing on narrow domains or posing ambiguous questions that hinder fair comparison. Guided by these principles, we introduce LiveResearchBench, a benchmark of 100 expert-curated tasks spanning daily life, enterprise, and academia, each requiring extensive, dynamic, real-time web search and synthesis. Built with over 1,500 hours of human labor, LiveResearchBench provides a rigorous basis for systematic evaluation. To evaluate citation-grounded long-form reports, we present DeepEval, a comprehensive suite covering both content- and report-level quality: checklists for coverage and presentation, rubric-tree assessments of citation accuracy and traceability, and metrics for consistency and depth of analysis. Using LiveResearchBench and DeepEval, we conduct a comprehensive evaluation of frontier deep research systems, including single-agent web search, single-agent deep research, and multi-agent systems. Our analysis reveals current strengths, recurring failure modes, and key system components needed to advance reliable, insightful deep research.