Poster
in
Workshop: Will Synthetic Data Finally Solve the Data Access Problem?

Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity Visual Descriptions

jiarui zhang · Ollie Liu · Tianyu Yu · Jinyi Hu · Willie Neiswanger

Project Page [ OpenReview]

Abstract

Multimodal large language models (MLLMs) have made rapid progress in recent years, yet continue to struggle with low-level visual perception (LLVP)---particularly the ability to accurately describe the geometric details of an image.In this paper, we first demonstrate this limitation by introducing Geoperception, a benchmark designed to evaluate an MLLM’s ability to accurately transcribe 2D geometric information from an image. We then conduct a comprehensive empirical study to explore strategies for improving LLVP performance through the use of synthetic high-fidelity visual description data. Our findings highlight the benefits of certain model architectures and training techniques, including the use of CNN-based visual encoders and multi-stage training with a data curriculum. Notably, we find that a data curriculum enables models to learn challenging geometry understanding tasks which they fail to learn from scratch. Lastly, we develop \emph{Euclid}, a family of models specifically optimized for strong low-level geometric perception. Although trained on synthetic multimodal data, Euclid shows strong generalization ability on novel real-world geometry shapes. For instance, Euclid outperforms the best closed-source model in our benchmark by up to 58.56% on certain Geoperception benchmark tasks and 10.65% on average across all tasks.

Chat is not available.