Poster
in
Workshop: Workshop on Sparsity in LLMs (SLLM): Deep Dive into Mixture of Experts, Quantization, Hardware, and Inference
S2-ATTENTION: HARDWARE-AWARE CONTEXT SHARDING AMONG ATTENTION HEADS
Xihui Lin · Yunan Zhang · Suyu Ge · Liliang Ren · Barun Patra · Vishrav Chaudhary · Hao Peng · Xia Song
Sparse attention, which selectively attends to a subset of tokens in the context, hasbeen an established approach to enhance the efficiency of Transformers. However,its theoretical reduction in FLOPs has rarely translated into wall-clock speed-upover its dense attention counterparts, mainly due to the lack of hardware-leveloptimizations like FlashAttention (Dao, 2023). Meanwhile, it remains unclearwhether sparse attention can maintain the model’s quality at the scale of today’slarge language models (LLMs), and how this can be achieved. This paper presentsSparsely-Sharded Attention (S2-ATTENTION), an optimized Triton kernel libraryproviding a variety of customizable sparse attention implementations for bothtraining and inference. S2-ATTENTION allows customizing the attention patterns atper head per context range level. The fresh insights from S2-ATTENTION inspire anovel sparse attention architecture that meets several desiderata that we find crucialfor achieving both practical efficiency gains and strong accuracy on downstreamtasks, called as Head-Heterogenous Strided Transformer (HHST). For highersparsity, HHST shards the context heterogeneously across attention heads, whereeach head attends to a different subset of tokens while collectively covering thewhole. We evaluate HHST by pretraining 1.3B and 7B sized models. For attentioncomputation, HHST with S2-ATTENTION achieves 8.8× and 15.9× wall-clockattention speedup, as well as 2.8× and 2.5× training time reduction compared to adense attention baseline implemented with FlashAttention-2. Moreover, HHST’sdownstream task performance is on-par with dense attention, and achieves a perfectretrieval accuracy at a 128K context length at 7B scale. At inference, our 7BHHST, achieves a 4.5× speed-up compared to the dense counterparts in vLLM. S2-ATTENTION is released with easy-to-customize APIs for direct usage in Megatronand vLLM.