Spotlight
in
Workshop: Mathematical and Empirical Understanding of Foundation Models (ME-FoMo)

Objectives Matter: Understanding the Impact of Self-Supervised Objectives on Vision Transformer Representations

Shashank Shekhar ⋅ Florian Bordes ⋅ Pascal Vincent ⋅ Ari Morcos

Keywords: vision transformer Representation Similarity Masked Image Modelling Reconstruction Based Learning Joint Embedding Learning Contrastive Modelling ViT self supervised learning

2023 Spotlight
in
Workshop: Mathematical and Empirical Understanding of Foundation Models (ME-FoMo)

Project Page [ OpenReview]

Abstract

Joint-embedding based learning (e.g., SimCLR, MoCo, DINO) and reconstruction-based learning (e.g., BEiT, SimMIM, MAE) are the two leading paradigms for self-supervised learning of vision transformers, but they differ substantially in their transfer performance. Here, we aim to explain these differences by analyzing the impact of these objectives on the structure and transferability of their representations. Our analysis reveals that reconstruction-based learning features are significantly dissimilar to joint-embedding based learning features and that models trained with similar objectives learn similar features even across architectures. These differences arise early in the network, primarily driven by attention and normalization layers. We find that joint-embedding features yield better linear probe transfer for classification because the different objectives drive different distributions of information and invariances in the representation. These differences explain opposite trends in transfer performance for downstream tasks that require spatial specificity in features. Finally, we address how fine-tuning changes reconstructive representations to enable better transfer, showing that it re-organizes the information to be more similar to pre-trained joint embedding models.

Video

Chat is not available.