Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Mathematical and Empirical Understanding of Foundation Models (ME-FoMo)

The Independent Compositional Subspace Hypothesis for the Structure of CLIP's Last Layer

Max Wolff · Wieland Brendel · Stuart Wolff


Abstract:

In this paper, we propose a hypothesis which posits that CLIP disentangles compositional visual attributes into orthogonal, independent subspaces which CLIP uses to build compositional representations of images. Our hypothesis suggests that CLIP learns compositional techniques that are similar to humans'. We find five core compositional attributes predicted by the hypothesis: color, size, counting, camera view, and pattern. We empirically test their properties and find that they code for their respective compositional attribute type and are essentially orthogonal to one another, as well as the subject of the image.

Chat is not available.