Skip to yearly menu bar Skip to main content


In-Person Poster presentation / poster accept

wav2tok: Deep Sequence Tokenizer for Audio Retrieval

Adhiraj Banerjee · Vipul Arora

MH1-2-3-4 #157

Keywords: [ Unsupervised and Self-supervised learning ] [ audio search ] [ sequence representation learning ] [ music retrieval ]


Abstract:

Search over audio sequences is a fundamental problem. In this paper, we propose a method to extract concise discrete representations for audio that can be used for efficient retrieval. Our motivation comes from orthography which represents speech of a given language in a concise and distinct discrete form. The proposed method, wav2tok, learns such representations for any kind of audio, speech or non-speech, from pairs of similar audio. wav2tok compresses the query and target sequences into shorter sequences of tokens that are faster to match. The learning method makes use of CTC loss and expectation-maximization algorithm, which are generally used for supervised automatic speech recognition and for learning discrete latent variables, respectively. Experiments show the consistent performance of wav2tok across two audio retrieval tasks: music search (query by humming) and speech search via audio query, outperforming state-of-the-art baselines.

Chat is not available.