Poster
in
Workshop: Quantify Uncertainty and Hallucination in Foundation Models: The Next Frontier in Reliable AI
DETECTING UNRELIABLE RESPONSES IN GEN- ERATIVE VISION-LANGUAGE MODELS VIA VISUAL UNCERTAINTY
Kiana Avestimehr · Emily Aye · Zalan Fabian · Erum Mushtaq
Keywords: [ Visual Uncertainty ] [ Multimodal Uncertainty Estimation ]
Uncertainty estimation (UE) is critical for assessing the reliability of generative vision-language models (VLMs). Existing UE approaches often require access to internal model representations to train an uncertainty estimator, which may not always be feasible. Black-box methods primarily rely on language-based augmentations, such as question rephrasings or sub-question modules, to detect unreliable generations. However, the role of visual information in UE remains largely underexplored. To study this aspect of the UE research problem, we investigate a visual contrast approach that perturbs input images by removing visual evidence relevant to the question and measures changes in the output distribution. We hypothesize that for unreliable generations, the output token distributions from an augmented and unaugmented image remain similar despite the removal of key visual information in the augmented image. We evaluate this method on the adversarial OKVQA dataset using four popular pre-trained VLMs. Our results demonstrate that visual contrast, even when applied only at the first token, can be as effective as—if not always superior to—existing state-of-the-art probability-based black-box methods.