In-batch Ensemble Drafting: Robust Speculative Decoding for LVLMs
Abstract
Despite the success of Speculative Decoding (SD) in LLM inference acceleration, it largely remains unexplored for Large Vision Language Models (LVLMs), an advanced class of LLMs that can handle multimodal prompts consisting of text and image tokens. To bridge this gap, we first conduct a comprehensive benchmarking study, focusing on the effectiveness of various drafting methods. We observe that various drafting methods have their own advantages, and none of them consistently outperforms the others. Motivated by this observation, we propose In-batch Ensemble Drafting (IbED), a simple yet effective SD method for LVLMs. IbED leverages multiple drafting methods without incurring much additional latency via batch inference and, compared to multimodal drafting, consistently demonstrates significant improvements in block efficiency, averaging 6% (with a maximum of 23%) across a wide range of datasets.