Poster

Align With Purpose: Optimize Desired Properties in CTC Models with a General Plug-and-Play Framework

Eliya Segev ⋅ Maya Alroy ⋅ Ronen Katsir ⋅ Noam Wies ⋅ Ayana Shenhav ⋅ Yael Ben-Oren ⋅ David Zar ⋅ Oren Tadmor ⋅ Jacob Bitterman ⋅ Amnon Shashua ⋅ Tal Rosenwein

2024 Poster

[ Poster] [ OpenReview]

Abstract

Connectionist Temporal Classification (CTC) is a widely used criterion for training supervised sequence-to-sequence (seq2seq) models. It learns the alignments between the input and output sequences by marginalizing over the perfect alignments (that yield the ground truth), at the expense of the imperfect ones. This dichotomy, and in particular the equal treatment of all perfect alignments, results in a lack of controllability over the predicted alignments. This controllability is essential for capturing properties that hold significance in real-world applications. Here we propose Align With Purpose (AWP), a general Plug-and-Play framework for enhancing a desired property in models trained with the CTC criterion. We do that by complementing the CTC loss with an additional loss term that prioritizes alignments according to a desired property. AWP does not require any intervention in the CTC loss function, and allows to differentiate between both perfect and imperfect alignments for a variety of properties. We apply our framework in the domain of Automatic Speech Recognition (ASR) and show its generality in terms of property selection, architectural choice, and scale of the training dataset (up to 280,000 hours). To demonstrate the effectiveness of our framework, we apply it to two unrelated properties: token emission time for latency optimization and word error rate (WER). For the former, we report an improvement of up to 590ms in latency optimization with a minor reduction in WER, and for the latter, we report a relative improvement of 4.5% in WER over the baseline models. To the best of our knowledge, these applications have never been demonstrated to work on this scale of data. Notably, our method can be easily implemented using only a fewlines of code and can be extended to other alignment-free loss functions and to domains other than ASR.

Video

Chat is not available.