Poster
in
Workshop: World Models: Understanding, Modelling and Scaling

Object-Centric Representations Generalize Better Compositionally with Less Compute

Ferdinand Kapl · Amir Mohammad Karimi Mamaghan · Max Horn · Carsten Marr · Stefan Bauer · Andrea Dittadi

Keywords: object-centric learning visual question answering Compositional generalization

Project Page [ OpenReview]

Abstract

Compositional generalization—the ability to reason about novel combinations of familiar concepts—is fundamental to human cognition and a critical challenge for machine learning. Object-Centric representation learning has been proposed as a promising approach for achieving this capability. However, systematic evaluation of these methods in visually complex settings remains limited. In this work, we introduce a benchmark to measure how well vision encoders, with and without object-centric biases, generalize to unseen combinations of object properties. Using CLEVRTex-style images, we create multiple training splits with partial coverage of object property combinations and generate question-answer pairs involving object properties to assess compositional generalization on a held-out test set.We compare visual representations from a strong pretrained baseline, DINOv2, with DINOSAURv2, which extracts object-centric representations from the same DINOv2 backbone. In our comparison, we control for differences in feature map sizes and compute budgets. Our key findings reveal that object-centric approaches (1) converge faster on in-distribution data but underperform slightly when non-object-centric models are given a significant compute advantage, and (2) they exhibit superior compositional generalization, outperforming DINOv2 on unseen combinations of object properties while requiring approximately eight times less downstream compute.

Video

Chat is not available.