Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Workshop on Sparsity in LLMs (SLLM): Deep Dive into Mixture of Experts, Quantization, Hardware, and Inference

Efficient Transformers via MPO-Based Low-Rank Factorization and Pruning

Sam Mikhak · Venkata Sai Gummidi · Praneeth Medepalli · Kevin Zhu


Abstract:

We explore the use of matrix product operators (MPOs) to compress transformer-based architectures. By factorizing full-rank weight matrices into tensor-train product, MPOs reduce both memory footprint and computational cost, which is critical for deployment on resource‑constrained devices. Our experiments on speaker identification using the LibriSpeech train-clean-360 subset show that MPO-based models, and even their pruned variants, maintain high performance with far fewer parameters than full‑rank transformers. We detail the mathematical principles underlying low‑rank factorization and unstructured pruning and discuss next steps for extending this approach to more complex tasks such as automatic speech recognition (ASR).

Chat is not available.