Poster
in
Workshop: World Models: Understanding, Modelling and Scaling

Do Vision-Language Models Have Internal World Models? Towards an Atomic Evaluation

Qiyue Gao · Xinyu Pi · Kevin Liu · Junrong Chen · Ruolan Yang · Xinqi Huang · Xinyu Fang · Lu Sun · Gautham Kishore · Bo Ai · Stone Tao · Mengyang Liu · Jiaxi Yang · Chao-Jung Lai · Chuanyang Jin · Jiannan Xiang · Benhao Huang · David Danks · Hao Su · Tianmin Shu · Ziqiao Ma · Lianhui Qin · Zhiting Hu

Keywords: Vision Language Model World Model

Project Page [ OpenReview]

Abstract

World models (WMs) serve as an agent’s compressed internal representation of its environment and enable the simulation of state transitions. Traditional, environment-specific WMs struggle to generalize across domains with distinct dynamics. Recent large-scale Vision-Language Models (VLMs) offer a potential path to “generalist” WMs but lack a systematic evaluation. Drawing on comparative psychology and cognitive science, we propose a two-stage framework of perception (visual, spatial, temporal, quantitative, motion) and prediction (mechanistic simulation, transitive inference, compositional inference) to theorize the functioning of world models. Guided by this framework, we introduce a large-scale simulator-based dataset with controlled interventions and counterfactual simulations to assess whether VLMs possess an internal world model. Our findings show that although VLMs perform well in low-level visual perception, they struggle with spatiotemporal perception and fail to capture core real-world dynamics, remaining far from human-level competency.

Video

Chat is not available.