Poster
in
Workshop: ICLR 2026 Workshop on Multimodal Intelligence: Next Token Prediction and Beyond

TowerVision: Understanding and Improving Multilinguality in Vision-Language Models

Andre Viveiros ⋅ Patrick Fernandes ⋅ Saul Santos ⋅ Sonal Sannigrahi ⋅ Emmanouil Zaranis ⋅ Nuno Guerreiro ⋅ Amin Farajian ⋅ Graham Neubig ⋅ Andre Martins

Project Page [ OpenReview]

Abstract

Despite rapid progress in vision-language models (VLMs), most existing approaches remain English-centric, often relying on undisclosed training data or recipes limiting their effectiveness and reproducibility in multilingual settings. In this work, we present a systematic empirical study of how to best incorporate multilinguality across training data, encoder choices, and language models. Our results show that high-quality multilingual vision-language data substantially improves cross-lingual generalization, enabling effective transfer both from high-resource to underrepresented languages and in the opposite direction. We further find that language models with strong multilingual priors are often more effective than initializing from general-purpose language models. Guided by these findings, we design TowerVision, a family of open-source multilingual VLMs, built on the multilingual text-only model Tower+. TowerVision-9B achieves competitive performance across a range of multimodal multilingual benchmarks, with particular strength in culturally grounded tasks and multimodal translation. Notably, our models outperform existing approaches trained on substantially larger datasets, as shown on ALM-Bench and Multi30K. Along with the models, we release VisionBlocks, a high-quality, curated vision-language dataset.

Chat is not available.