Poster
in
Workshop: ICLR 2026 Workshop on Multimodal Intelligence: Next Token Prediction and Beyond

INDEX-PRESERVING LIGHTWEIGHT TOKEN PRUNING FOR EFFICIENT DOCUMENT UNDERSTANDING IN VISION-LANGUAGE MODELS

Jaemin Son ⋅ ⋅ Inyong Yun

Project Page [ OpenReview]

Abstract

Recent progress in vision-language models (VLMs) has led to strong accuracy on document understanding tasks such as parsing and key information extraction, but processing high-resolution document images remains computationally expensive. We propose a lightweight pre-encoder token pruning framework that removes non-informative background patches using a binary text-region classifier with a max-pooling refinement step. The framework preserves token indices to maintain the spatial correspondence required for layout-sensitive recognition. Experiments on real-world document benchmarks show 40–60\% FLOPs reduction while maintaining comparable accuracy.

Chat is not available.