Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Deep Generative Model in Machine Learning: Theory, Principle and Efficacy

Multi-view Geometry-Aware Diffusion Transformer for Indoor Novel View Synthesis

Xueyang Kang · Zhengkang Xiang · Zezheng Zhang · Kourosh Khoshelham

Keywords: [ Novel View Synthesis. ] [ Epipolar Attention ] [ Multiview ] [ Inpainting ] [ Diffusion Transformer ] [ Plücker Raymap ] [ Geometry ]


Abstract:

Recent progress in novel view synthesis of indoor scenes using diffusion models have attracted significant attention, particularly for generating desired poses from a source image. Existing methods can produce plausible views near the input view, but they often fail to extrapolate views far beyond the input perspective. Additionally, achieving a multiview consistent diffusion model typically requires training of computational-heavy 3D priors, limiting scalability to long-range generation. In this paper, we present a transformer-based latent diffusion model that leverages view geometry constraints, including explicitly warped feature maps of the input view as the denoised target view and a conditioning combination of epipolar-weighted source image feature map, Plücker raymap, and camera poses. This approach allows for the extrapolation of consistent novel views, both semantically and geometrically, over long-range trajectories in a single-shot manner. Our model is evaluated on two indoor datasets, ScanNet and RealEstate10K, using a diverse set of metrics for view quality and consistency evaluation. Experimental results demonstrate the superiority of our approach over existing models, showcasing its potential for semantically and geometrically consistent novel view synthesis scalably in video generation applications.

Chat is not available.