Poster Thu, Apr 23, 2026 • 11:15 AM – 1:45 PM PDT Pavilion 4 P4-#5113

Boosting Multi-Domain Reasoning of LLMs via Curvature-Guided Policy Optimization

Xize Liang ⋅ Lin Yang ⋅ Jie Wang ⋅ Rui Liu ⋅ Yang Lu ⋅ Jinliang Zeng ⋅ Hanzhu Chen ⋅ Dong Li ⋅ Jianye Hao

[ Poster] [ OpenReview]

Abstract

Multi-domain reinforcement learning (RL) for large language models (LLMs) involves highly intricate reward surfaces, posing significant challenges in finding parameters that excel across all domains. Recent empirical studies have further highlighted conflicts among domains, where gains in one capability often come at the expense of another. However, approaches to mitigate such conflicts and enhance multi-domain reasoning remain largely underexplored. To address this challenge, we propose Curvature-Guided Policy Optimization (CGPO), a principled and scalable training framework to advance the multi-domain reasoning of LLMs. Inspired by Newton's method, CGPO exploits the geometric structure in the reward surface, while sidestepping the prohibitive cost of Hessian computation. At each update, CGPO processes domains in random order, preconditioning their gradients with curvature information from other domains to foster richer cross-domain interactions. This mechanism further promotes implicit gradient alignment by maximizing inter-domain inner products in expectation, steering the parameters toward regions that jointly enhance multi-domain performance. Extensive experiments on a mixed dataset covering math, coding, science, and creative writing, evaluated across seven widely-used benchmarks, show that CGPO significantly outperforms all baselines in terms of faster reward improvement and stronger multi-domain capability.

Video

Chat is not available.