Poster Thu, Apr 23, 2026 • 11:15 AM – 1:45 PM PDT

Squeeze the Soaked Sponge: Efficient Off-policy RFT for Large Language Model

Jing Liang · Hongyao Tang · Yi Ma · Jinyi Liu · YAN ZHENG · Shuyue Hu · LEI BAI · Jianye HAO

Abstract

Reinforcement Learning (RL) has demonstrated its potential to improve the reasoning ability of Large Language Models (LLMs), yet most existing Reinforcement Finetuning (RFT) methods are inherently \textit{on-policy} RL, failing to reuse historical data and thus preventing efficient scaling. In this work, we explore the potential of \textit{off-policy} RL to leverage historical data for rollout-efficient RFT. Specifically, we propose \textbf{Re}incarnating \textbf{Mix}-policy Proximal Policy Gradient (\textbf{ReMix}), which enables on-policy RFT methods to leverage off-policy data. ReMix consists of three major components: (1) Mix-policy proximal policy gradient with an increased Update-To-Data (UTD) ratio that utilizes the data from both current and past policies for efficient training; (2) KL-Convex policy constraint that combines the KL constraints on the base and precedent model to balance stability and flexibility; (3) Policy reincarnation that replaces the base model with the mix-policy RFT model in the mid way of training and restarts on-policy training, to achieve a seamless transition from early efficiency to steady convergence. In our experiments, we train a series of ReMix models based on PPO, GRPO from 1.5B, 7B base models. On five math reasoning benchmarks (i.e., AIME'24, AMC'23, Minerva, OlympiadBench, and MATH500), ReMix achieves an average Pass@1 accuracy of \textbf{52.10\%} (with \textbf{0.079M rollouts}) and \textbf{64.39\%} (with \textbf{0.011M rollouts}) on 1.5B and 7B models, respectively. Compared with 15 recent advanced models, ReMix shows SOTA-level performance with an over \textbf{30x to 450x reduction in training cost in terms of rollout data volume}, demonstrating superior training efficiency. Additionally, our multifaceted analysis reveals insightful findings, including the implicit preference for shorter responses of off-policy RFT, the collapse mode of self-reflection under severe off-policyness, etc.