PAGE-4D: Disentangled Pose and Geometry Estimation for 4D Perception
Abstract
Recent 3D feed-forward models, such as the Visual Geometry Grounded Transformer (VGGT), have shown strong capability in inferring 3D attributes of static scenes. However, since they are typically trained on static datasets, these models often struggle in real-world scenarios involving complex dynamic elements, such as moving humans or deformable objects like umbrellas. To address this limitation, we introduce PAGE-4D, a feedforward model that extends VGGT to dynamic scenes, enabling camera pose estimation, depth prediction, point cloud reconstruction, and point tracking—all without post-processing. Training a geometry transformer for dynamic scenes from scratch, however, demands large-scale dynamic datasets and substantial computational resources, which are often impractical. To overcome this, we propose an efficient fine-tuning strategy that allows PAGE-4D to generalize to dynamic scenarios using only limited dynamic data and compute. In particular, we design a dynamics-aware aggregator that disentangles dynamic from static content for downstream scene understanding tasks: it first predicts a dynamics-aware mask, which then guides a dynamics-aware global attention mechanism. Extensive experiments show that PAGE-4D consistently outperforms the original VGGT in dynamic scenarios, achieving superior results in camera pose estimation, monocular and video depth estimation, and dense point map reconstruction. The source code and pretrained model weights are provided in the supplementary material.