Poster
in
Workshop: Quantify Uncertainty and Hallucination in Foundation Models: The Next Frontier in Reliable AI
[TINY] Vision language models can implicitly quantify aleatoric uncertainty
Xi Wang · Eric Nalisnick
Keywords: [ Vision language model; Uncertainty; Multi-modality ]
Recent advances in vision language models (VLMs), such as GPT-4o, have revolutionized visual reasoning by enabling zero-shot task completion through natural language instructions. In this paper, we study VLMs' ability to detect input ambiguities, i.e., aleatoric uncertainty. Our key finding is that VLMs can effectively identify ambiguous inputs by simply including an instruction to output "Unknown" when uncertain. Through experiments on corrupted ImageNet and ``OOD'' detection tasks, we demonstrate that VLMs successfully reject uncertain inputs while maintaining high accuracy on confident predictions. This capability for implicitly quantifying uncertainty emerges without additional training or in-context learning, distinguishing VLMs from traditional vision models that often produce overconfident predictions on ambiguous inputs.