Skip to yearly menu bar Skip to main content


Blog Track Poster

Multi-modal Learning: A Look Back and the Road Ahead

Divyam Madaan · Sumit Chopra · Kyunghyun Cho

[ ]
2025 Blog Track Poster

Abstract:

Advancements in language models has spurred an increasing interest in multi-modal AI — models that process and understand information across multiple forms of data, such as text, images and audio. While the goal is to emulate human-like ability to handle diverse information, a key question is: do human-defined modalities align with machine perception? If not, how does this misalignment affect AI performance? In this blog, we examine these questions by reflecting on the progress made by the community in developing multi-modal benchmarks and architectures, highlighting their limitations. By reevaluating our definitions and assumptions, we propose ways to better handle multi-modal data by building models that analyze and combine modality contributions both independently and jointly with other modalities.

Chat is not available.