Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Mathematical and Empirical Understanding of Foundation Models (ME-FoMo)

Modality-Aware Adaptation of Contrastive Language-Image Models

Alexander Long · Thalaiyasingam Ajanthan · Anton Hengel

Keywords: [ vision-language ] [ CLIP ] [ foundation models ] [ Few Shot Learning ] [ CLIP Adapter ]


Abstract:

Despite their high levels of robustness, Contrastive Language-Image Models (CLIP) still require some form of downstream adaptation when applied to tasks sufficiently out-of-domain with respect to their training set. Recent methods propose light-weight adapters on the model features and show strong performance, primarily focused on the few-shot domain. All such approaches however, require per-task hyperparameter tuning which necessitates access to a validation set; limiting their applicability in practice. As an alternative, we propose Modality Aware Tangent-space Retrieval (MATeR), a training-free, interpretable adapter which outperforms all recent methods when per-task hyperparameter turning is prohibited. MATeR considers the manifold formed by CLIP embeddings when incorporating out of domain few-shot class information and its predictions are invariant to the modality gap; representing the first approach that considers the geometric structure of the CLIP latent space to inform downstream task adaptation. Additionally, we demonstrate a variant of MATeR has the ability to significantly increase zeroshot accuracy with only a handful of unlabelled images, much lower than the number of classes.

Chat is not available.