VisualPRM400K: An Effective Dataset for Training Multimodal Process Reward Models
Abstract
We construct VisualPRM400K, a dataset comprising about 400K multimodal process supervision data. Building upon this dataset, we develop VisualPRM, an advanced multimodal Process Reward Model (PRM) capable of estimating the value score of each step during the reasoning process. Under the Best-of-N evaluation setting, our model improves the reasoning performance of three types of MLLMs and four different model scales. Even when applied to the highly capable InternVL2.5-78B, it achieves a 5.9-point improvement across seven multimodal reasoning benchmarks. Experimental results show that the PRM model trained on our VisualPRM400K exhibits superior performance compared to Outcome Reward Models and Self-Consistency during BoN evaluation. To further facilitate the development of multimodal PRMs, we construct VisualProcessBench, a benchmark designed to measure the abilities of PRMs and MLLMs to detect incorrect steps in multimodal reasoning tasks. We hope that our work can inspire more future research and contribute to the development of MLLMs. Our model, data, and benchmark will be released.