Skip to yearly menu bar Skip to main content


Spotlight
in
Workshop: Mathematical and Empirical Understanding of Foundation Models (ME-FoMo)

Objectives Matter: Understanding the Impact of Self-Supervised Objectives on Vision Transformer Representations

Shashank Shekhar · Florian Bordes · Pascal Vincent · Ari Morcos

Keywords: [ self supervised learning ] [ ViT ] [ Contrastive Modelling ] [ Joint Embedding Learning ] [ Reconstruction Based Learning ] [ Masked Image Modelling ] [ Representation Similarity ] [ vision transformer ]


Abstract:

Joint-embedding based learning (e.g., SimCLR, MoCo, DINO) and reconstruction-based learning (e.g., BEiT, SimMIM, MAE) are the two leading paradigms for self-supervised learning of vision transformers, but they differ substantially in their transfer performance. Here, we aim to explain these differences by analyzing the impact of these objectives on the structure and transferability of their representations. Our analysis reveals that reconstruction-based learning features are significantly dissimilar to joint-embedding based learning features and that models trained with similar objectives learn similar features even across architectures. These differences arise early in the network, primarily driven by attention and normalization layers. We find that joint-embedding features yield better linear probe transfer for classification because the different objectives drive different distributions of information and invariances in the representation. These differences explain opposite trends in transfer performance for downstream tasks that require spatial specificity in features. Finally, we address how fine-tuning changes reconstructive representations to enable better transfer, showing that it re-organizes the information to be more similar to pre-trained joint embedding models.

Chat is not available.