Visuals Lie, Consistency Speaks: Disentangling Spatial Attention from Reliability in Vision-Language Models
Logan Mann ⋅ Yi Xia ⋅ Ajit Saravanan ⋅ Ishan Dave ⋅ Saadullah ismail ⋅ Shikhar Shiromani ⋅ Emily Huang ⋅ Ruizhe Li ⋅ Kevin Zhu
Abstract
Reliable deployment of vision-language models requires detecting hallucinations. The Attention-Confidence Assumption states that answers are trustworthy when cross-attention is spatially focused on relevant image regions. We test this hypothesis with LLaVA Probe, analyzing three VLMs (LLaVA-1.5, PaliGemma, and Qwen2-VL) on POPE, LLaVA-Bench, and custom counting/spatial tasks. We extract cross-attention maps and quantify their spatial structure using cluster count ($C_{k}$), spatial entropy ($H_{s}$), and layer-wise dynamics ($\Delta H_{s}$). Across 50k+ samples, structural attention provides almost no signal for correctness (e.g., $R\approx0.001$), despite being causally necessary for performance. In contrast, reliability is best predicted by generation dynamics: self-consistency across sampled outputs correlates with accuracy ($R=0.429$), and perfect agreement yields precision above 90%. Overall, reliability signals in current VLMs are largely detached from attention heatmaps and are better recovered from the language model's generative behavior.
Chat is not available.
Successful Page Load