OC-PRM: Overcredit-Contrastive Training for Precision-First Process Reward Models
Aakriti Agrawal ⋅ Souradip Chakraborty ⋅ Armin Saghafian ⋅ Nihal Sharma ⋅ Rizal Fathony ⋅ Nam Nguyen ⋅ C. Bruss ⋅ Amrit Bedi ⋅ Furong Huang
Abstract
Process reward models (PRMs) offer step-level supervision for reasoning LLMs, but in practice they often \emph{overcredit} incorrect steps, inducing high false positive rates that distort decoding and compound over long chains. We show analytically that in Best-of-$N$ selection, false positives impose an asymptotic alignment ceiling (set by the PRM's precision), whereas false negatives primarily increase sample complexity and slow convergence. Motivated by this asymmetry, we introduce a label-efficient training recipe that requires no new human annotation: we convert existing step labels into matched positive-negative comparisons, optimize a novel \emph{Overcredit Contrastive (OC)} objective, and rebalance supervision using lightweight negative augmentation and a simple difficulty curriculum. On PRMBench~\citep{song2025prmbench}, our method sharply reduces false positives and improves macro F1 over strong discriminative and generative PRMs. When deployed for guided beam search and Best-of-$N$ selection, the resulting PRMs yield higher downstream task accuracy and improved robustness. Overall, our results suggest that comparison-centered training with balanced step data provides a practical path to trustworthy process supervision without additional human labels.
Chat is not available.
Successful Page Load