Outrageously Large Context Windows via RACE Attention -- A Family of Non-Linear Attention that can be calculated in Strictly Linear-Time
Sahil Joshi · Agniva Chowdhury · Amar Kanakamedala · Ekam Singh · Evan Tu · Anshumali Shrivastava
Abstract
Softmax Attention has a quadratic time complexity in sequence length, which becomes prohibitive to run at long contexts, even with highly optimized GPU kernels. For example, FlashAttention2 and FlashAttention3 (exact, GPU-optimized implementations of Softmax Attention) cannot complete a single forward–backward pass of a multi-head attention layer once the context exceeds $\sim4$ million tokens on an NVIDIA GH200 (96 GB). We introduce RACE Attention, a kernel-inspired alternative to Softmax Attention that is strictly linear in sequence length and embedding dimension. RACE Attention replaces the exponential kernel with a sharpened angular similarity, and approximates attention outputs via randomized projections and soft Locality-Sensitive Hashing (LSH). Across language modeling, masked language modeling, and text/image classification, RACE Attention matches or outperforms strong baselines while reducing wall-clock time and memory. In a controlled scale test, it processes up to 12 million tokens during a single forward-backward pass on an NVIDIA GH200 GPU and 75 million tokens on an Intel Xeon® Gold 5220R CPU—well beyond the practical limits of the current state-of-the-art attention implementations. RACE Attention thus offers a practical and theoretically grounded mechanism for long-context training on today’s hardware.
Successful Page Load