Do Vision-Language Models Have Internal World Models? Towards an Atomic Evaluation
Abstract
World models (WMs) serve as an agent’s compressed internal representation of its environment and enable the simulation of state transitions. Traditional, environment-specific WMs struggle to generalize across domains with distinct dynamics. Recent large-scale Vision-Language Models (VLMs) offer a potential path to “generalist” WMs but lack a systematic evaluation. Drawing on comparative psychology and cognitive science, we propose a two-stage framework of perception (visual, spatial, temporal, quantitative, motion) and prediction (mechanistic simulation, transitive inference, compositional inference) to theorize the functioning of world models. Guided by this framework, we introduce a large-scale simulator-based dataset with controlled interventions and counterfactual simulations to assess whether VLMs possess an internal world model. Our findings show that although VLMs perform well in low-level visual perception, they struggle with spatiotemporal perception and fail to capture core real-world dynamics, remaining far from human-level competency.