INDEX-PRESERVING LIGHTWEIGHT TOKEN PRUNING FOR EFFICIENT DOCUMENT UNDERSTANDING IN VISION-LANGUAGE MODELS
Jaemin Son ⋅ ⋅ Inyong Yun
Abstract
Recent progress in vision-language models (VLMs) has led to strong accuracy on document understanding tasks such as parsing and key information extraction, but processing high-resolution document images remains computationally expensive. We propose a lightweight pre-encoder token pruning framework that removes non-informative background patches using a binary text-region classifier with a max-pooling refinement step. The framework preserves token indices to maintain the spatial correspondence required for layout-sensitive recognition. Experiments on real-world document benchmarks show 40–60\% FLOPs reduction while maintaining comparable accuracy.
Chat is not available.
Successful Page Load