Poster
in
Workshop: ICLR 2026 Workshop on Multimodal Intelligence: Next Token Prediction and Beyond

Can Vision Models Process Physiological Signals? Exploring Visual Tokenization as a Representation Interface

Westby ⋅ Li Meng ⋅ Anis Yazidi ⋅ Ali Ramezani-Kebrya

Project Page [ OpenReview]

Abstract

Multimodal foundation models increasingly use unified representation interfaces, but the tradeoffs between architectural unification and domain-specific tokenization remain unclear. We explore visual tokenization as a representation interface for emotion recognition by converting physiological signals and text into images processed by frozen Vision Transformers. This parameter-efficient approach, which trains only 0.85\% of weights, enables multimodal learning without modality-specific encoders. We identify when frozen pretrained features succeed and when they require domain adaptation. We also discuss design considerations for balancing unified processing with domain-appropriate inductive biases.

Chat is not available.