Poster
in
Workshop: Mathematical and Empirical Understanding of Foundation Models (ME-FoMo)

Controlled assessment of CLIP-style language-aligned vision models in prediction of brain & behavioral data

Colin Conwell · Jacob Prince · Christopher Hamblin · George Alvarez

Keywords: Multimodality SimCLR language-alignment neural regression visual representation CLIP

2023 Poster
in
Workshop: Mathematical and Empirical Understanding of Foundation Models (ME-FoMo)

Project Page [ OpenReview]

Abstract

One of the core algorithmic forces driving the development of modern foundation models is the use of contrastive language alignment to facilitate more robust visual representation learning. The clear benefits conferred by CLIP-style multimodal objective functions in computer vision have generated a frenzy of interest in the application of these models to a long-debated question in cognitive neuroscience: to what extent does language shape perceptual representation in the human mind? In this work, we explore this question in two distinct domains: the prediction of brain activity in the human ventral visual system (as measured by high-resolution fMRI), and the prediction of visually evoked affect in human image assessment (as measured by self-report). In both of these cases, we leverage popular open-source foundation models (e.g. OpenAI's CLIP) in conjunction with empirically controlled alternatives (e.g. Meta AI's SLIP models) to better isolate the effects of language alignment while holding architecture and dataset constant. These controlled experiments offer mixed evidence regarding the influence of language on perceptual representation: specifically, when architecture and dataset are held constant, we find no evidence that language-alignment improves the brain predictivity of vision models, but we do find strong evidence that it increases predictivity of behavioral image assessments. We offer these examples as a case study in the urgency of injecting greater empirical control into the development and evaluation of foundation models, whose emergent properties may be attributable to a variety of sources that only systematic model comparison can fully disentangle.

Video

Chat is not available.