Test-Time Matching: Unlocking Compositional Reasoning in Multimodal Models
Abstract
Multimodal models have achieved remarkable progress, yet recent studies suggest they struggle with compositional reasoning, often performing at or below random chance on established benchmarks. We revisit this problem and show that widely used evaluation metrics systematically underestimate model capabilities. To address this, we introduce a group matching score that better leverages group structure and uncovers substantial hidden competence in both contrastive vision–language models (VLMs) and multimodal large language models (MLLMs). Moreover, simply overfitting to the induced group matchings at test time transfers this hidden competence into higher scores under the original evaluation metric, closing much of the reported gap. With this adjustment, GPT-4.1 becomes the first system to surpass estimated human performance on Winoground. Building on this insight, we propose Test-Time Matching (TTM), an iterative self-training algorithm that bootstraps model performance without any external supervision. TTM delivers further non-trivial improvements: for example, SigLIP-B16 with TTM surpasses GPT-4.1 on MMVP-VLM, establishing a new state of the art. Importantly, TTM is broadly effective even on benchmarks without metric-induced effects or group structures, achieving relative gains exceeding 85.7\% on challenging datasets such as Whatsup. Across 16 datasets and variants, our experiments consistently demonstrate that TTM unlocks hidden compositional reasoning ability and advances the frontier of multimodal evaluation.