TangoFlux: Text to Audio Generation with CLAP-Ranked Preference Optimization
Abstract
We introduce TangoFlux, an efficient Text-to-Audio (TTA) generative model with 515M parameters, capable of generating up to 30 seconds of 44.1kHz audio in 3.7 seconds on a A40 GPU. A key challenge in aligning TTA models lies in creating preference pairs, as TTA lacks structured mechanisms like verifiable rewards or gold-standard answers available for Large Language Models (LLMs). To address this, we propose CLAP-Ranked Preference Optimization (CRPO), a novel framework that iteratively generates and optimizes preference data to enhance TTA alignment. We show that the audio preference dataset generated using CRPO outperforms the static alternatives. With this framework, TangoFlux achieves state-of-the-art performance across both objective and subjective benchmarks. https://tangoflux56.github.io/TangoFlux/ holds the model-generated audio samples for comparison.