Poster
in
Workshop: Workshop on Sparsity in LLMs (SLLM): Deep Dive into Mixture of Experts, Quantization, Hardware, and Inference
DeltaMoE: Memory-Efficient Inference for Merged Mixture of Experts with Delta Compression
Boyko Borisov · Xiaozhe Yao · Nezihe Merve Gürel · Ana Klimovic
Abstract:
Sparse Mixture of Experts (SMoEs) have emerged as an efficient architecture for large language models. While recent community efforts have focused on merging multiple models to create SMoEs, deploying these merged models remains challenging due to their substantial memory requirements. In this paper, we present DeltaMoE, a training-free delta compression pipeline that enables efficient deployment of SMoE models through structured sparsity and quantization. Our evaluation shows that DeltaMoE achieves up to a $2.34\times$ compression ratio and $2.57\times$ throughput improvement. DeltaMoE is also scalable with the number of experts, making it particularly suitable for large SMoE models.
Chat is not available.
Successful Page Load