FastVGGT: Fast Visual Geometry Transformer
You Shen · Zhipeng Zhang · Yansong Qu · Xiawu Zheng · Jiayi Ji · Shengchuan Zhang · Liujuan Cao
Abstract
Scaling visual geometry transformers for long image sequences poses a significant computational and memory challenge. In this work, we diagnose this issue in the state-of-the-art model VGGT, and trace the primary bottleneck to its Global Attention layer. Our analysis reveals a ``token collapse'' phenomenon, where many tokens attend to nearly identical regions, resulting in redundant computation and inefficiency. Motivated by this finding, we propose FastVGGT, a training-free framework that strategically prunes these redundant tokens. Instead of uniform merging, FastVGGT employs a tailored, three-part token partitioning strategy. It preserves initial-frame tokens as a stable global reference, retains salient tokens to maintain fine details, and utilizes region-based random sampling to ensure spatially balanced coverage. Extensive experiments on multiple 3D geometry benchmarks validate our approach's effectiveness. Notably, on sequences of 1000 images, FastVGGT achieves a 4$\times$ speedup over the original VGGT while simultaneously mitigating error accumulation, demonstrating its efficiency and robustness for long-sequence scenarios. For further details, please visit our project page: https://fastvggt.github.io/.
Successful Page Load