Aurelius: Relation Aware Text-to-Audio Generation At Scale
Abstract
We present Aurelius, a new framework that enables relation aware text-to-audio (TTA) generation research at scale. Given the lack of essential audio event and relation corpora, \emph{Aurelius} contributes a large-scale audio event corpus \emph{AudioEventSet} and another large-scale relation corpus \emph{AudioRelSet}. Comprising 110 event categories, AudioEventSet maximally covers all commonly heard audio events and each event is unique, realistic and of high-quality. AudioRelSet consists of 100 relations, comprehensively covering the relations that present in the physical world or can be neatly described by text. As the two corpora provide audio event and relation independently, they can be combined to create massive pairs with our pair generation strategy to support relation aware TTA investigation at scale. We comprehensively benchmark all existing TTA models from both general and relation aware evaluation perspective. We further provide in-depth investigation on scaling up existing TTA models' relation aware generation by either training from scratch or leveraging cross-domain general TTA knowledge. The introduced corpora and the findings through investigation in this work potentially facilitate future research on relation aware TTA generation.