Turing: A Framework for Evaluating Agents on Stateful, Multi-Step Real-World Workflows
Abstract
Agent benchmarks are evolving from static prediction tasks to multi-step tasks requiring tool use, yet most evaluations remain short-horizon, loosely coupled, and state-agnostic. These settings fail to capture the properties that determine reliability in real workflows: persistent state, dense relational structure, role-based access control, and policy-constrained execution. This talk introduces an evaluation framework built on three components: workflow-grounded task definitions, scenario construction derived from operational traces, and a stateful execution environment with persistent databases and verifiable outcomes. Agents are evaluated across tasks that require multi-step planning over interconnected subsystems, exposing failure modes that simply don't appear in short-horizon benchmarks. Scoring combines automated outcome verification with rubrics that assess domain reasoning quality, including constraint prioritization and decision-making under ambiguity. We show that performance degrades predictably with increasing horizon length and constraint density. The primary bottleneck for frontier models is strategic planning under constraint, not tool invocation accuracy. Structural failures surface reliably through outcome checks, while subtler breakdowns in judgment require reasoning-quality assessment. These results suggest that deployable autonomy requires evaluation frameworks that integrate realistic workflows, operational constraints, stateful simulation, and scoring that captures both outcome correctness and the quality of domain-grounded reasoning. This session outlines the design principles, empirical findings, and open research challenges that define this next generation of agent evaluation.