Poster Fri, Apr 24, 2026 • 11:15 AM – 1:45 PM PDT Pavilion 4 P4-#4715

On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification

Yongliang Wu ⋅ Yizhou Zhou ⋅ Ziheng Zhou ⋅ Yingzhe Peng ⋅ Xinyu Ye ⋅ Xinting Hu ⋅ Wenbo Zhu ⋅ Lu Qi ⋅ Ming-Hsuan Yang ⋅ xu yang

Project Page [ OpenReview]

Abstract

In this work, we present a simple yet theoretically motivated improvement to Supervised Fine-Tuning (SFT) for the Large Language Model (LLM), addressing its limited generalization compared to reinforcement learning (RL). Through mathematical analysis, we reveal that standard SFT gradients implicitly encode a problematic reward structure that may severely restrict the generalization capabilities of model compared to RL. To rectify this, we propose Dynamic Fine-Tuning (DFT), stabilizing gradient updates for each token by dynamically rescaling the objective function with the probability of this token. With just a single-line change, the method outperforms standard SFT on multiple difficult benchmarks and base models, from math reasoning to code generation and multi-modal tasks, demonstrating improved generalization. Additionally, \model~achieves competitive results in offline RL settings, providing an effective yet streamlined alternative. By bridging theoretical insights with practical solutions, this work advances the state of SFT. The source code will be available at https://github.com/yongliang-wu/DFT.

Video

Chat is not available.