SparseD: Sparse Attention for Diffusion Language Models
Zeqing Wang · Gongfan Fang · Xinyin Ma · Xingyi Yang · Xinchao Wang
Abstract
While diffusion language models (DLMs) offer a promising alternative to autoregressive models (ARs), existing open-source DLMs suffer from high inference latency. This bottleneck is mainly due to the attention’s quadratic complexity with respect to context length in computing all query–key pairs. Intuitively, to reduce this complexity, restricting computation to sparse attention patterns that retain only the most important ones offers an effective solution. This type of method is widely used in ARs, where the attention mechanism exhibits clear and fixed sparse patterns. In DLMs, our analysis also reveals the presence of sparse patterns and further highlights three unique observations: (1) attention patterns vary across heads, (2) attention patterns in each head remain highly similar across denoising steps, and (3) early denoising steps are critical for generation. These unique findings render well-studied fixed sparse attention methods in ARs largely incompatible with DLMs, as their fixed patterns fail to capture head-specific patterns in DLMs, and sparse attention applied in the early steps can lead to degradation in generation. To address these challenges, we propose **SparseD**, a novel sparse attention method for DLMs. Leveraging the observations in DLMs, SparseD only pre-computes and selects the most important query–key pairs once as head-specific sparse patterns for reusing across denoising steps. This manner can handle head-specific patterns without incurring the high latency associated with recomputing sparse patterns at each denoising step. Meanwhile, SparseD skips sparse attention and uses full attention in the early steps to preserve generation quality. Together, these establish SparseD as a practical and efficient solution for deploying DLMs in long-context applications. Experimental results demonstrate that SparseD achieves lossless acceleration, delivering up to $1.50\times$ speedup over FlashAttention at a 64k context length with 1,024 denoising steps. Anonymous code is available at https://anonymous.4open.science/r/SparseD-8C76/.
Successful Page Load