Visual Prompt-Agnostic Evolution
Junze Wang · Lei Fan · Dezheng Zhang · Weipeng Jing · Donglin Di · Yang Song · Sidong Liu · Cong Cong
Abstract
Visual Prompt Tuning (VPT) enables effective adaptation of a frozen Vision Transformer (ViT) to downstream tasks by inserting a small number of learnable prompt tokens into the token sequence at each layer. However, we observe that existing VPT variants often suffer from unstable training dynamics, characterized by gradient oscillations. A closer layer-wise analysis reveals that shallow-layer prompts tend to stagnate early, while deeper-layer prompts exhibit high-variance oscillations, leading to a cross-layer mismatch. These issues contribute to slower convergence and degraded final performance. To address these challenges, we propose the Prompt-Agnostic Evolution ($\mathtt{PAE}$) method, which can strengthen vision prompt tuning by explicitly modeling the dynamics of learnable prompts. From a frequency-domain perspective, we initialize prompts in a task-aware direction by uncovering and propagating frequency shortcut patterns that the backbone inherently exploits for recognition. To ensure coherent evolution across layers, we further employ a shared Koopman operator, which imposes a global linear transformation rather than uncoordinated, layer-specific updates. Finally, inspired by Lyapunov stability theory, we introduce a regularizer that constrains error amplification during evolution. Extensive experiments demonstrate that using $\mathtt{PAE}$ with VPT variants not only accelerates convergence with an average 1.41$\times$ speedup but also yields 1–3% gains on 25 datasets with multi downstream tasks. Beyond performance, $\mathtt{PAE}$ remains prompt-agnostic and lightweight, and it integrates seamlessly with diverse VPT variants without backbone modification or inference-time changes, providing a practical and scalable solution for advancing prompt tuning.
Successful Page Load