Poster
in
Workshop: Workshop on Sparsity in LLMs (SLLM): Deep Dive into Mixture of Experts, Quantization, Hardware, and Inference
Sparse and Wide Linear RNNs Are at the Efficiency-Performance Pareto Front
Alessandro Pierro · Steven Abreu · Jonathan Timcheck · Philipp Stratmann · Sumit Shrestha
Abstract:
Linear recurrent neural networks enable powerful long-range sequence modeling with constant memory usage and time-per-token during inference.These architectures hold promise for streaming applications at the edge, but deployment in resource-constrained environments requires hardware-aware optimizations to minimize latency and energy consumption.In this paper, we investigate the effectiveness of unstructured sparsity--both in weights and activations--at reducing the computational demand of linear RNNs, as well as its combination with quantization.We find that highly sparse linear RNNs consistently achieve better efficiency-performance trade-offs than dense baselines, with $2\times$ less compute and $36$% less memory at iso-accuracy, and quantizing a sparse-and-wide network leads to lower performance degradation.When quantized to fixed-point arithmetic and deployed on the Intel Loihi 2 neuromorphic chip, sparse models demonstrate $42 \times$ lower latency and $149\times$ lower energy consumption compared to an iso-accuracy dense model on an edge GPU, providing hardware validation to the theoretical gains of unstructured sparsity.
Chat is not available.
Successful Page Load