Poster Fri, Apr 24, 2026 • 11:15 AM – 1:45 PM PDT Pavilion 4 P4-#3501

Vision-SR1: Self-Rewarding Vision-Language Model via Reasoning Decomposition and Multi-Reward Policy Optimization

Zongxia Li ⋅ Wenhao Yu ⋅ Chengsong Huang ⋅ Zhenwen Liang ⋅ Rui Liu ⋅ Fuxiao Liu ⋅ Jingxi Chen ⋅ Dian Yu ⋅ Jordan Boyd-Graber ⋅ Haitao Mi ⋅ Dong Yu

Project Page [ OpenReview]

Abstract

Vision-Language Models (VLMs) often suffer from visual hallucinations -- generating things that are not consistent with visual inputs -- and language shortcuts, where they skip the visual part and just rely on text priors. These issues arise because most post-training methods for VLMs rely on simple verifiable answer matching and supervise only final outputs, leaving intermediate visual reasoning without explicit guidance. As a result, VLMs receive sparse visual signals and often learn to prioritize language-based reasoning over visual perception. To mitigate this, some existing methods add visual supervision using human annotations or distilled labels from external large models. However, human annotations are labor-intensive and costly, and external signals can introduce high latency cost. In this paper, we introduce Vision-SR1, a three-stage self-rewarding reinforcement learning method that improves visual reasoning without relying on external visual supervision. Vision-SR1 decomposes VLM reasoning into two components: visual reasoning and language reasoning, where the model is first prompted to produce self-contained visual descriptions sufficient to answer the question without referring back to the input image, before jointly optimizing both visual and language reasoning through our multi-reward loss objective. To validate this self-containment, the same VLM model is re-prompted to perform language reasoning using only the generated visual reasoning as input to compute visual reward. The final reward is computed through a decoupled reward-advantage framework, where visual reward and language reasoning reward each have their advantages, log probabilities, and KL divergence calculated separately. This decoupling enables more fine-grained reward computation by preventing the entanglement of heterogeneous reward signals. Our experiments show that Vision-SR1 improves visual reasoning, mitigates visual hallucinations, and reduces reliance on language shortcuts across diverse vision-language tasks, while being more efficient than methods that rely on external visual reward models, which require additional GPUs to host. In contrast, Vision-SR1 introduces no extra GPU overhead beyond that of standard training.

Video

Chat is not available.