Depth Over Specialization in Small Multimodal Transformers
Abstract
Shared encoders are commonly used in large-scale multimodal contrastive models, but it is less clear whether their advantages persist in small, parameter-constrained regimes. We investigate this question through a focused empirical study on a naturally aligned text, image, and speech dataset, training multimodal contrastive models under strict transformer parameter budgets. Across a range of small model configurations, we observe that allocating transformer depth to a single shared encoder often yields better retrieval performance than splitting the same capacity across modality-specific encoders. We further find that merging modality-specific encoders into a shared encoder can substantially reduce transformer parameters while preserving comparable performance on several modality pairs. Finally, in trimodal training, we observe an empirical trade-off in which adding a third modality improves weaker modality pairs (e.g., image--audio) while degrading stronger ones under fixed capacity. These results suggest that, in tightly constrained settings, depth allocated to shared representations can be an effective default for parameter-efficient multimodal learning.