RESA: Bringing Back What Sparse Attention Ignores with Residual Estimation
Weihao Yang · Hao Huang · Ningke Li · Shihao Wang · Darong Yang · Yanqi Pan · Wen Xia · Shiyi Li · Xiangyu Zou
Abstract
Large Language Models (LLM) have gained significant attention. KV cache, stored to avoid quadratic complexity of attention, becomes a bottleneck due to the demands for long-context. Sparse attention (SA) has been proposed to address this by only selecting critical KVs for attention, which may degrade model quality in less sparse scenarios. To improve quality, rather than selecting more KVs, this paper reveals another perspective by estimating the contributions of remaining KVs, called Residual Estimation. We find that attention logits (before softmax) exhibit substantial redundancy due to its inherent low-rank nature. We perform Singular Value Decomposition (SVD) on logits matrix in prefilling and find the spectral dominance of principal singular value and its linearly scaling property with sequence length. These imply that increasing sequence length leads to replication-like logits growth with significant redundancy. However, it is impossible to perform SVD at each decoding step in practice due to its heavy costs. To this end, we propose RESA, a training-free framework compensating SA's output with an estimated low-rank prior of logits. RESA introduces (i) a Prior Estimator that derives a prior distribution from a typical query as a rank-1 approximation at the end of prefilling, and (ii) an Online Aggregator that fuses the prior with SA at each decoding step via lightweight scaling and merging. Besides, we further show that RESA's effect comes from priors being used as attention bias for knowledge injection. Extensive experiments show that without extra overheads, RESA improves model quality by up to 26\% across various tasks with the same KV budget compared to state-of-the-art. Moreover, RESA maintains the same quality with up to 33.2\% KV budget reduction and 1.23$\times$ attention throughput improvement.
Successful Page Load