Skip to yearly menu bar Skip to main content


Poster

Stabilizing Off-Policy Reinforcement Learning for LLMs via Balanced Policy Optimization with Adaptive Clipping

Zhiheng Xi · Xin Guo · Yang Nan · Enyu Zhou · Junrui Shen · Wenxiang Chen · Jiaqi Liu · Jixuan Huang · Xun Deng · Zhihao Zhang · Honglin Guo · Zhikai Lei · Miao Zheng · Guoteng Wang · Peng Sun · Rui Zheng · Hang Yan · Tao Gui · Qi Zhang · Xuanjing Huang

Abstract

Log in and register to view live content