Poster

Synergistic Patch Pruning for Vision Transformer: Unifying Intra- & Inter-Layer Patch Importance

Yuyao Zhang ⋅ Lan Wei ⋅ Nikolaos Freris

2024 Poster

[ Poster] [ OpenReview]

Abstract

The Vision Transformer (ViT) has emerged as a powerful architecture for various computer vision tasks. Nonetheless, this comes with substantially heavier computational costs than Convolutional Neural Networks (CNNs). The attention mechanism in ViTs, which integrates information from different image patches to the class token ([CLS]), renders traditional structured pruning methods used in CNNs unsuitable. To overcome this issue, we propose SynergisTic pAtch pRuning (STAR) that unifies intra-layer and inter-layer patch importance scoring. Specifically, our approach combines a) online evaluation of intra-layer importance for the [CLS] and b) offline evaluation of the inter-layer importance of each patch. The two importance scores are fused by minimizing a weighted average of Kullback-Leibler (KL) Divergences and patches are successively pruned at each layer by maintaining only the top-k most important ones. Unlike prior art that relies on manual selection of the pruning rates at each layer, we propose an automated method for selecting them based on offline-derived metrics. We also propose a variant that uses these rates as weighted percentile parameters (for the layer-wise normalized scores), thus leading to an alternate adaptive rate selection technique that is input-based. Extensive experiments demonstrate the significant acceleration of the inference with minimal performance degradation. For instance, on the ImageNet dataset, the pruned DeiT-Small reaches a throughput of 4,256 images/s, which is over 66\% higher than the much smaller (unpruned) DeiT-Tiny model, while having a substantially higher accuracy (+6.8\% Top-1 and +3.1\% Top-5).

Video

Chat is not available.