Uniform Discrete Diffusion with Metric Path for Video Generation
Abstract
Continuous-space video generation has advanced rapidly, while discrete approaches lag behind due to error accumulation and long-context inconsistency. In this work, we revisit discrete generative modeling and present Uniform Discrete diffusion with Metric path (UDM), a simple yet powerful framework that bridges the gap with continuous methods and enables scalable video generation. At its core, UDM formulates video synthesis as iterative refinement over discrete spatio-temporal tokens. It is based on two key designs: a Linearized Metric-Path and a Resolution-dependent Timestep Shifting mechanism. This design enables UDM to scale efficiently to high-resolution image synthesis and long-duration video generation (up to 32k tokens), while requiring significantly fewer inference steps. Additionally, we introduce an asynchronous temporal fine-tuning strategy that unifies multiple tasks, including video interpolation and image-to-video synthesis, within a single model. Extensive experiments on challenging video and image generation benchmarks show that UDM consistently outperforms prior discrete methods and achieves performance comparable to state-of-the-art continuous diffusion methods.