$\alpha$-DPO: Robust Preference Alignment for Diffusion Models via $\alpha$ Divergence
Yang Li · Songlin Yang · Wei Wang · Xiaoxuan Han · Jing Dong
Abstract
Diffusion models have demonstrated remarkable success in high-fidelity image generation, yet aligning them with human preferences remains challenging. Direct Preference Optimization (DPO) offers a promising framework, but its effectiveness is critically hindered by noisy data arising from mislabeled preference pairs and individual preference pairs. We theoretically show that existing DPO objectives are equivalent to minimizing the Forward Kullback–Leibler (KL) divergence, whose mass-covering nature makes it intrinsically sensitive to such noise. To address this limitation, we propose $\alpha$-DPO, which reformulates preference alignment through the lens of $\alpha$-divergence. This formulation promotes mode-seeking behavior and bounds the influence of outliers, thereby enhancing robustness. Furthermore, we introduce a dynamic scheduling mechanism that adaptively adjusts $\alpha$ according to the observed preference distribution, providing data-aware noise tolerance during training. Extensive experiments on synthetic and real-world datasets validate that $\alpha$-DPO consistently outperforms existing baselines, achieving superior robustness and preference alignment.
Successful Page Load