Can Vision Models Process Physiological Signals? Exploring Visual Tokenization as a Representation Interface
Westby ⋅ Li Meng ⋅ Anis Yazidi ⋅ Ali Ramezani-Kebrya
Abstract
Multimodal foundation models increasingly use unified representation interfaces, but the tradeoffs between architectural unification and domain-specific tokenization remain unclear. We explore visual tokenization as a representation interface for emotion recognition by converting physiological signals and text into images processed by frozen Vision Transformers. This parameter-efficient approach, which trains only 0.85\% of weights, enables multimodal learning without modality-specific encoders. We identify when frozen pretrained features succeed and when they require domain adaptation. We also discuss design considerations for balancing unified processing with domain-appropriate inductive biases.
Chat is not available.
Successful Page Load