Poster
in
Workshop: SCOPE: SCALABLE OPTIMIZATION FOR EFFICIENT AND ADPATIVE FOUNDATION MODELS

Attention Is All You Need For Mixture-of-Depths Routing

Advait Gadhikar · Souptik Kumar Majumdar · Niclas Popp · Piyapat Saranrittichai · Martin Rapp · Lukas Schott

Keywords: Parameter free routing Mixture of Depths Attention Routing

Project Page [ Poster] [ OpenReview]

Abstract

Advancements in deep learning are driven by training models with increasingly larger numbers of parameters, which in turn heightens the computational demands. To address this issue, Mixture-of-Depths (MoD) models have been proposed to dynamically focus computations on the most relevant parts of the inputs, thereby enabling the deployment of large-parameter models with high efficiency during inference and training. However, conventional MoD models employ additional network layers specifically for the routing which are difficult to train, and add complexity to the model. In this paper, we introduce a novel attention-based routing mechanism *A-MoD* that leverages the existing attention map of the preceding layer for routing decisions within the current layer. Compared to standard routing, *A-MoD* allows for more efficient training as it introduces no additional trainable parameters and can be easily adapted from pre-trained transformer models. Furthermore, it can increase the performance of the MoD model. For instance, we observe up to $2$\% higher accuracy on ImageNet compared to standard routing and isoFLOP ViT baselines.

Video

Chat is not available.