Scale AI: New Frontier of AI: Eval, RL, and What's Next
Abstract
As large language models (LLMs) evolve from short-burst chatbots into long-horizon autonomous agents, progress is increasingly bottlenecked by verification asymmetry: rapid gains in domains with cheap correctness signals (e.g., math and code) contrast sharply with limited progress in tasks with weak or delayed verification, such as research planning and strategic decision-making. This talk argues that evaluation and reinforcement learning (RL) beyond easily verifiable domains are the next critical frontier for AI capability. We present results from three new evaluation frameworks. Humanity’s Last Exam (HLE) shows that frontier models are frequently wrong and overconfident at the human-expert level. The Remote Labor Index (RLI) demonstrates that current agents automate only ~2.5% of real, paid freelance work. Visual ToolBench reveals that 70–80% of multimodal agent failures stem from visual perception rather than reasoning. To close these gaps, we introduce Rubrics as Rewards (RaR) within a Group Relative Policy Optimization (GRPO) framework. We show that Dynamic Rubrics, which adaptively elicit evaluation criteria by contrasting model outputs during training, outperform static human-written rubrics and reduce reward hacking in the high-reward regime. These findings motivate a shift from static benchmarks to high-fidelity RL environments, such as Scale Gymnasium, that train agents through interaction rather than imitation.