Skip to yearly menu bar Skip to main content


Oral
in
Workshop: Workshop on Spurious Correlation and Shortcut Learning: Foundations and Solutions

Unveiling and Mitigating Short-Cut Learning in Multimodal In-Context Learning

Yanshu Li

Keywords: [ In-context Learning ] [ Multimodal ]


Abstract:

The performance of Large Vision-Language Models (LVLMs) in In-Context Learning (ICL) is heavily influenced by the quality of ICL sequences, particularly in tasks requiring cross-modal reasoning and open-ended generation. To address this challenge, we innovatively interpret multimodal ICL from the perspective of task mapping. We systematically model local and global relationships within in-context demonstrations (ICDs) and demonstrate their core role and cohesion in enhancing LVLM performance. Inspired by these findings, we propose Ta-ICL, a lightweight transformer-based model equipped with task-aware attention to dynamically configure ICL sequences. By integrating task mapping into the autoregressive process, Ta-ICL achieves bidirectional enhancement between sequence configuration and task reasoning. Through extensive experiments, we demonstrate that Ta-ICL effectively improves multimodal ICL across various LVLMs and tasks. Our results highlight the potential of task mapping to be widely applied in enhancing multimodal reasoning, paving the way for robust and generalizable multimodal ICL frameworks.

Chat is not available.