SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse–Linear Attention
Jintao Zhang · Haoxu Wang · Kai Jiang · Shuo Yang · Kaiwen Zheng · Haocheng Xi · Ziteng Wang · Hongzhou Zhu · Min Zhao · Ion Stoica · Joseph E Gonzalez · Jun Zhu · Jianfei Chen
Abstract
In Diffusion Transformer (DiT) models, particularly for video generation, attention latency is a major bottleneck due to the long sequence length and the quadratic complexity. Interestingly, we find that attention weights can be decoupled into two matrices: a small fraction of large weights with high rank and the remaining weights with very low rank. This naturally suggests applying sparse acceleration to the first part and low-rank acceleration to the second. Based on this finding, we propose SLA (**S**parse-**L**inear **A**ttention), a trainable attention method that fuses sparse and linear attention to accelerate diffusion models. SLA classifies attention weights into critical, marginal, and negligible, applying $\mathcal{O}(N^2)$ attention to critical weights, $\mathcal{O}(N)$ attention to marginal weights, and skipping negligible ones. SLA combines these computations into a single GPU kernel and supports both forward and backward passes. With only a few fine-tuning steps using SLA, DiT models achieve a $\textbf{20x}$ reduction in attention computation, resulting in significant acceleration without loss of generation quality. Experiments show that SLA reduces attention computation by $\textbf{95}$\% without degrading end-to-end generation quality, outperforming baseline methods. In addition, we implement an efficient GPU kernel for SLA, which yields a $\textbf{13.7x}$ speedup in attention computation and a $\textbf{2.2x}$ end-to-end speedup in video generation on Wan2.1-1.3B.
Successful Page Load