Poster Thu, Apr 23, 2026 • 11:15 AM – 1:45 PM PDT

Group Verification-based Policy Optimization for Interactive Coding Agents

Dai · Haolun Wu · Huanran Zheng · Tao Ji · Yuanbin Wu · Junchi Yan · Dell Zhang · Xiaoling Wang · Xuelong Li · Changzhi Sun

Abstract

Recent advancements in reinforcement learning from verifiable rewards (RLVR), particularly through Group Relative Policy Optimization (GRPO), have significantly improved the capabilities of large language models (LLMs) for interactive coding agents. However, these methods overlook process-verifiable environment feedback (e.g., code execution failures), leading to inaccurate advantage estimation at each reasoning step and insufficient learning. To address this issue, we propose Group Verification-based Policy Optimization (GVPO), a novel RL algorithm that introduces an advantage shaping framework integrating both outcome-verifiable and process-verifiable signals. While outcome-verifiable rewards ensure alignment with long-term task objectives, process-verifiable feedback derived from intermediate execution traces (e.g., syntax errors, runtime exceptions) serves as corrective shaping terms at the step level. By jointly leveraging these two forms of verifiability, GVPO achieves more accurate credit assignment, balancing short-term process guidance with long-term outcome alignment. This unified formulation yields more stable optimization, faster convergence, and stronger generalization in complex interactive environments. A 32B-parameter agent trained with GVPO in the AppWorld environment outperforms OpenAI’s o1 agent by 12.6\%s on the more challenging Test-C split and surpasses the strongest 32B RL-trained state-of-the-art baseline by 3.6\%.