Towards Spatial Supersensing in Video
Abstract
We frame spatial supersensing in video as an overarching goal for multimodal intelligence and argue that progress requires a shift from long-context brute force to predictive sensing. Using a four-level taxonomy: semantic perception, streaming event cognition, implicit 3D spatial cognition, and predictive world modeling, we audit existing benchmarks and show they focus heavily on the first tier, with only partial coverage of streaming and spatial cognition, and almost never test true world modeling. To ground these gaps, we introduce VSI-Super, a two-part benchmark for continual spatial sensing: VSO (long-horizon spatial observation and recall) and VSC (continual counting under changing viewpoints and scenes). These tasks admit arbitrarily long video inputs and are specifically built so that simply scaling tokens or context length isn’t enough. Within the current paradigm, we push spatial cognition by curating VSI-590K and training a new family of video MLLMs that deliver 30% absolute on VSI-Bench without sacrificing general semantic perception. Yet these models still underperform on VSI-Super, exposing a paradigm gap. We then prototype predictive sensing: a self-supervised next latent-frame predictor whose surprise (prediction error) drives long-horizon memory and event segmentation. On VSI-Super, this approach substantially outperforms leading video MLLMs, evidencing that advancing spatial supersensing requires models that not only see but also anticipate, select, and organize experience.