Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Workshop on Sparsity in LLMs (SLLM): Deep Dive into Mixture of Experts, Quantization, Hardware, and Inference

A UNIFIED FRAMEWORK FOR SHAPE PRESERVING COMPRESSION OF LARGE LANGUAGE MODELS

Lawrence Liu · Inesh Chakrabarti · Yixiao Li · Mengdi Wang · Tuo Zhao · Lin Yang


Abstract:

Large language models (LLMs) exhibit remarkable performance across various natural language processing tasks but suffer from immense computational and memory demands, limiting their deployment in resource-constrained environ ments. To address this challenge, we propose NoWA (Normalized Weight and Activation Compression), a unified framework for zero-shot shape preservingcompression algorithms. We compressed Llama-2 7B/13B/70B and Llama-3 8B models, using two popular forms of shape-preserving compression, vector quantization NoWA-VQ (NoWA for Vector Quantization), and unstructured/structured pruning NoWA-P (NoWA for Pruning). We found that NoWA-VQ significantly outperforms state-of-the-art zero shot VQ, and that NoWA-P performs competitively against state-of-the-art methods.

Chat is not available.