TowerVision: Understanding and Improving Multilinguality in Vision-Language Models
Abstract
Despite rapid progress in vision-language models (VLMs), most existing approaches remain English-centric, often relying on undisclosed training data or recipes limiting their effectiveness and reproducibility in multilingual settings. In this work, we present a systematic empirical study of how to best incorporate multilinguality across training data, encoder choices, and language models. Our results show that high-quality multilingual vision-language data substantially improves cross-lingual generalization, enabling effective transfer both from high-resource to underrepresented languages and in the opposite direction. We further find that language models with strong multilingual priors are often more effective than initializing from general-purpose language models. Guided by these findings, we design TowerVision, a family of open-source multilingual VLMs, built on the multilingual text-only model Tower+. TowerVision-9B achieves competitive performance across a range of multimodal multilingual benchmarks, with particular strength in culturally grounded tasks and multimodal translation. Notably, our models outperform existing approaches trained on substantially larger datasets, as shown on ALM-Bench and Multi30K. Along with the models, we release VisionBlocks, a high-quality, curated vision-language dataset.