Skip to yearly menu bar Skip to main content


Poster
in
Workshop: World Models: Understanding, Modelling and Scaling

Object-Centric Representations Generalize Better Compositionally with Less Compute

Ferdinand Kapl · Amir Mohammad Karimi Mamaghan · Max Horn · Carsten Marr · Stefan Bauer · Andrea Dittadi

Keywords: [ object-centric learning ] [ visual question answering ] [ Compositional generalization ]


Abstract:

Compositional generalization—the ability to reason about novel combinations of familiar concepts—is fundamental to human cognition and a critical challenge for machine learning. Object-Centric representation learning has been proposed as a promising approach for achieving this capability. However, systematic evaluation of these methods in visually complex settings remains limited. In this work, we introduce a benchmark to measure how well vision encoders, with and without object-centric biases, generalize to unseen combinations of object properties. Using CLEVRTex-style images, we create multiple training splits with partial coverage of object property combinations and generate question-answer pairs involving object properties to assess compositional generalization on a held-out test set.We compare visual representations from a strong pretrained baseline, DINOv2, with DINOSAURv2, which extracts object-centric representations from the same DINOv2 backbone. In our comparison, we control for differences in feature map sizes and compute budgets. Our key findings reveal that object-centric approaches (1) converge faster on in-distribution data but underperform slightly when non-object-centric models are given a significant compute advantage, and (2) they exhibit superior compositional generalization, outperforming DINOv2 on unseen combinations of object properties while requiring approximately eight times less downstream compute.

Chat is not available.