Virtual Poster presentation / poster accept

Unified Discrete Diffusion for Simultaneous Vision-Language Generation

Minghui HU ⋅ Chuanxia Zheng ⋅ Zuopeng Yang ⋅ Tat-Jen Cham ⋅ Heliang Zheng ⋅ Chaoyue Wang ⋅ Dacheng Tao ⋅ Ponnuthurai Suganthan

Keywords: multi-modal image generation Image Caption. Generative models

[ Poster] [ OpenReview]

Abstract

The recently developed discrete diffusion model performs extraordinarily well in generation tasks, especially in the text-to-image task, showing great potential for modeling multimodal signals. In this paper, we leverage these properties and present a unified multimodal generation model, which can perform text-based, image-based, and even vision-language simultaneous generation using a single model. Specifically, we unify the discrete diffusion process for multimodal signals by proposing a unified Markov transition matrix and a unified objective. Moreover, we design a multimodal mutual attention module to highlight the inter-modal linkages, which is vital for multimodal generation. Extensive experiments indicate that our proposed method can perform comparably to the state-of-the-art solutions in various generation tasks.

Video

Chat is not available.