Skip to yearly menu bar Skip to main content


Poster

Plug-and-Play: An Efficient Post-training Pruning Method for Large Language Models

Yingtao Zhang · Haoli Bai · Haokun Lin · Jialin Zhao · LU HOU · Carlo Vittorio Cannistraci

Halle B #225
[ ]
Fri 10 May 7:30 a.m. PDT — 9:30 a.m. PDT

Abstract:

With the rapid growth of large language models (LLMs), there is increasing demand for memory and computation in LLMs. Recent efforts on post-training pruning of LLMs aim to reduce the model size and computation requirements, yet the performance is still sub-optimal. In this paper, we present a plug-and-play solution for post-training pruning of LLMs.The proposed solution has two innovative components: 1) Relative Importance and Activations (RIA), a new pruning metric that jointly considers the weight and activations efficiently on LLMs, and 2) Channel Permutation, a new approach to maximally preserves important weights under N:M sparsity.The two proposed components can be readily combined to further enhance the N:M semi-structured pruning of LLMs.Our empirical experiments show that RIA alone can already surpass all existing post-training pruning methods on prevalent LLMs, e.g., LLaMA ranging from 7B to 65B. Furthermore, N:M semi-structured pruning with channel permutation can even outperform the original LLaMA2-70B on zero-shot tasks, together with practical speed-up on specific hardware.Our code is available at: https://github.com/biomedical-cybernetics/Relative-importance-and-activation-pruning

Live content is unavailable. Log in and register to view live content