Efficient Multimodal Spatial Reasoning via Dynamic and Asymmetric Routing
Abstract
Recently, visualization-of-thought (VoT) has unlocked new opportunities for complex spatial reasoning in multimodal large language models (MLLMs) by complementing verbal reasoning with visual thinking. However, the autoregressive accumulation of lengthy and redundant tokens substantially increases computation and memory costs. In this paper, we present a new efficient framework for multimodal spatial reasoning, named DARE, designed to adaptively prune multimodal tokens across different network depths, reasoning hops, and modalities. First, DARE devises an intra- and inter-hop-aware differentiable retention mechanism to dynamically estimate token importance both within each reasoning step and across successive hops. Recognizing that deeper network layers encode visual cues into verbal streams, DARE introduces an asymmetric compression strategy that prunes tokens according to modality-specific redundancy and semantic importance. Furthermore, DARE incorporates a progressive KV-cache retention policy aligned with cross-modal fusion dynamics, further reducing memory overhead during autoregressive reasoning. Our method delivers substantial reductions in computation and memory footprint, averaging a 40.37\% reduction in FLOPs and 46.07\% reduction in KV caches usage, while consistently preserving or even improving reasoning performance across seven multimodal spatial reasoning benchmarks, and further generalizing to broader multimodal reasoning tasks, establishing a scalable and robust recipe for efficient multimodal reasoning.