Poster
in
Workshop: Modular, Collaborative and Decentralized Deep Learning
ReM: Sparsify and MoEfy Models with Post-Hoc ReLU Modulation
Wenbo Zhang · Xiang Ren
The expanding scale of foundational models poses challenges in computational resources during model training and inference. A promising solution is to exploit contextual sparsity to convert monolithic modules into selectively computed ones like Mixture of Experts (MoE). However, existing conversion methods rely on predicting the activation sparsity in the original modules, which is only applicable to MLP modules and suffers from redundancy or performance degradation caused by inaccurate predictions. Moreover, models with non-ReLU activations either need to undergo a costly ReLUfication process or have lower activation sparsity. We propose that, instead of inducing sparsity in the original module and training the router to predict it, sparsity can be directly created by the router, which does not rely on specific properties of the main module and can have arbitrary granularity. We introduce ReM (ReLU Modulation), which involves training a modulator gated by ReLU that scales the hidden states (or outputs) of the original module to sparsify it. To obtain structured sparsity that enables parallelization, the weights of this modulator can be clustered to convert it into a MoE router. On BERT-base, ReM reduced inference FLOPs by up to 93\%–substantially improving upon prior methods–while maintaining comparable accuracy, and achieved these gains with over 99\% less retraining costs than previous methods.