Poster

PolyVoice: Language Models for Speech to Speech Translation

Qianqian Dong ⋅ Zhiying Huang ⋅ Qiao Tian ⋅ Chen Xu ⋅ Tom Ko ⋅ yunlong zhao ⋅ Siyuan Feng ⋅ Tang Li ⋅ Kexin Wang ⋅ Xuxin Cheng ⋅ Fengpeng Yue ⋅ Ye Bai ⋅ Xi Chen ⋅ Lu Lu ⋅ Zejun MA ⋅ Yuping Wang ⋅ Mingxuan Wang ⋅ Yuxuan Wang

2024 Poster

Project Page [ Poster] [ OpenReview]

Abstract

With the huge success of GPT models in natural language processing, there is a growing interest in applying language modeling approaches to speech tasks.Currently, the dominant architecture in speech-to-speech translation (S2ST) remains the encoder-decoder paradigm, creating a need to investigate the impact of language modeling approaches in this area. In this study, we introduce PolyVoice, a language model-based framework designed for S2ST systems. Our framework comprises three decoder-only language models: a translation language model, a duration language model, and a speech synthesis language model. These language models employ different types of prompts to extract learned information effectively. By utilizing unsupervised semantic units, our framework can transfer semantic information across these models, making it applicable even to unwritten languages. We evaluate our system on Chinese $\rightarrow$ English and English $\rightarrow$ Spanish language pairs. Experimental results demonstrate that \method outperforms the state-of-the-art encoder-decoder model, producing voice-cloned speech with high translation and audio quality.Speech samples are available at https://polyvoice.github.io.

Video

Chat is not available.