Vision Language Models are Biased
Abstract
Large language models (LLMs) memorize a vast amount of prior knowledge from the Internet that helps them on downstream tasks but also may notoriously sway their outputs towards wrong or biased answers. In this work, we test how the knowledge about popular subjects hurt the accuracy of vision language models (VLMs) on standard, objective visual tasks of counting and identification. We find that state-of-the-art VLMs are strongly biased (e.g., unable to recognize the 4th stripe has been added to a 3-stripe Adidas logo) scoring an average of 17.05% accuracy in counting (e.g., counting stripes in an Adidas-like logo) across 7 diverse domains from animals, logos, chess, game boards, optical illusions, to patterned grids. Removing image backgrounds nearly doubles accuracy (21.09 points), revealing that background visual cues trigger these biased responses. Further analysis of VLMs’ reasoning patterns shows that counting accuracy initially rises with thinking tokens, reaching ∼40%, before declining with model overthinking. Our work presents an interesting failure mode in VLMs and a human-supervised automated framework for testing VLM biases.