P$^2$-DPO:Grounding Hallucination in Perceptual Processing via Calibration Direct Preference Optimization
ruipeng zhang · Zhihao Li · Haozhang Yuan · C.L.Philip Chen · Tong Zhang
Abstract
Hallucination has recently garnered significant research attention in Large Vision-Language Models (LVLMs). Direct Preference Optimization (DPO) aims to learn directly from the corrected preferences provided by humans, thereby addressing the hallucination issue. Despite its success, this paradigm has yet to specifically target two critical causes of visual hallucination: the perceptual bottleneck in attended regions and insufficient Visual Robustness against image degradation. Furthermore, existing preference pairs are constructed by directly editing textual outputs without visual signals, and their off-policy nature limits effectiveness in guiding model learning. To address these challenges, we propose Perceptual Processing Direct Preference Optimization (P$^2$-DPO), a novel training paradigm where the model generates and learns from its own preference pairs, thereby directly addressing the identified visual bottlenecks while inherently avoiding the issues of vision-agnostic and off-policy data. It introduces: (1) an on-policy preference pair construction method targeting Focus-and-Enhance perception and Visual Robustness, and (2) a well-designed Calibration Loss to precisely align visual signals with the causal generation of text. Experimental results demonstrate that with a comparable amount of training data and cost, P$^2$-DPO outperforms even state-of-the-art methods that rely on costly human feedback on benchmarks such as POPE and MMHal-Bench. Furthermore, evaluations on Attention Region Fidelity (ARF) and image degradation scenarios validate the effectiveness of P$^2$-DPO in addressing perceptual bottlenecks in attended regions and improving Visual Robustness against degraded inputs.
Successful Page Load