UFO-4D: Unposed Feedforward 4D reconstruction from Two Images
Abstract
Dense 4D reconstruction from unposed images remains a critical challenge, with current methods relying on slow test-time optimization or fragmented, task-specific feedforward models. We introduce UFO-4D, a unified feedforward framework to reconstruct a dense, explicit 4D representation from just a pair of unposed images. UFO-4D directly estimates dynamic 3D Gaussian Splats, enabling the joint and consistent estimation of 3D geometry, 3D motion, and camera pose in a feedforward manner. Our core insight is that differentiably rendering different signals from a single, holistic representation enables significant advantages at training time, in the form of a self-supervisory image synthesis loss as well as tightly coupling motion and depth losses. This approach mitigates data scarcity, allowing UFO-4D to achieve joint estimation of geometry, motion, and camera pose while outperforming prior work by up to a factor of three. The 4D representation also enables high-fidelity 4D spatio-temporal interpolation.