Llamba: Scaling Distilled Recurrent Models for Efficient Language Processing
Abstract
We present the Llamba model series, a family of highly efficient recurrent language models distilled from the Llama-3.x family into the Mamba architecture. The series includes Llamba-1B, Llamba-4B, and Llamba-8B, delivering fast inference throughput while maintaining competitive benchmark performance. Beyond its computational advantages, Llamba showcases the effectiveness of the MOHAWK distillation framework, achieving high-quality performance while being distilled with less than 0.1\% of the data typically used for models of similar size. We also provide an optimized implementation of the Llamba models for deployment on resource-constrained devices, such as smartphones and edge platforms, providing a practical and memory-efficient alternative to traditional Transformer architectures. Overall, these models set new standards for speed, memory efficiency, and accessibility of language models.