Poster
in
Workshop: Workshop on Sparsity in LLMs (SLLM): Deep Dive into Mixture of Experts, Quantization, Hardware, and Inference
Accelerating Transformer Inference and Training with 2:4 Activation Sparsity
Daniel HAZIZA · Timothy Chou · Dhruv Choudhary · Jesse Cai · Luca Wehrstedt · Francisco Massa · Jiecao Yu · Geonhwa Jeong · Supriya Rao · Patrick Labatut
Abstract:
In this paper, we demonstrate how to apply 2:4 sparsity, a hardware-accelerated GPU sparsity pattern, to activations to accelerate large language model training and inference. Crucially we exploit the intrinsic sparsity found in Squared-ReLU activations to provide this acceleration with no accuraccy loss. Our approach achieves up to 1.3x faster Feed Forward Network (FFNs) in both the forwards and backwards pass. We also discuss the benefits of combining 2:4 sparsity with fp8 quantization to maximize efficiency gains. This work highlights the potential for sparsity to play a key role in accelerating large language model training and inference.
Chat is not available.
Successful Page Load