Poster
in
Workshop: ICLR 2026 Workshop on Multimodal Intelligence: Next Token Prediction and Beyond

Depth Over Specialization in Small Multimodal Transformers

Jakub Mroz ⋅ Henry Ndubuaku

Project Page [ OpenReview]

Abstract

Shared encoders are commonly used in large-scale multimodal contrastive models, but it is less clear whether their advantages persist in small, parameter-constrained regimes. We investigate this question through a focused empirical study on a naturally aligned text, image, and speech dataset, training multimodal contrastive models under strict transformer parameter budgets. Across a range of small model configurations, we observe that allocating transformer depth to a single shared encoder often yields better retrieval performance than splitting the same capacity across modality-specific encoders. We further find that merging modality-specific encoders into a shared encoder can substantially reduce transformer parameters while preserving comparable performance on several modality pairs. Finally, in trimodal training, we observe an empirical trade-off in which adding a third modality improves weaker modality pairs (e.g., image--audio) while degrading stronger ones under fixed capacity. These results suggest that, in tightly constrained settings, depth allocated to shared representations can be an effective default for parameter-efficient multimodal learning.

Chat is not available.