Diagnosing the Curse: A Scale-Consistent and All-Phase Metric for Modality Bias in MLLMs
Jinlin He ⋅ Chenfei Liao ⋅ Xu Zheng ⋅ Mengyu Jin ⋅ Xuming Hu
Abstract
Quantifying modality bias in multimodal large language models (MLLMs) plays a key role in diagnosing how these models reason across different input modalities. However, we identify that existing attention-based metrics suffer from **the scaling paradox** and **failure of the aggregation strategy**. 1. As image resolution increases, the quadratic expansion of visual tokens mathematically induces denominator-driven drift in per-token attention metrics, causing standard metrics to spuriously report extreme text dominance. 2. The existing sparse bias aggregation strategy by layer masks the true representation of modality bias, failing to correctly measure modality bias. To resolve these, we propose **Depth-wise Stratified Modality Dominance (DSMD)**. By conditioning attention analysis on input token-count quantiles, DSMDdecouples reasoning preference from token numbers. Furthermore, it incorporates an accuracy-weighted aggregation to pinpoint the layers driving correct predictions. Experiments on Qwen2.5-VL ($112^2$ to $896^2$) demonstrate that DSMD eliminates the spurious divergence observed in baselines, correctly reflecting the saturation of visual benefit.
Chat is not available.
Successful Page Load