Entropy-Monitored Kernelized Token Distillation for Audio-Visual Compression
Abstract
We propose a method for audio-visual knowledge distillation. Existing methods typically distill from the latent embeddings or outputs. The former requires matching feature dimensions, if not the same architecture, between teacher and student models while the latter supports any teacher-student pairing, but tends to be less performant. Unlike them, we do not explicitly distill from the latent embeddings or outputs, but the pairwise relationships between embeddings across samples for each modality; this is realized as a kernel, which is the crux of our method, ``Kernelized Token Distillation (KTD)''. Specifically, we tokenize and embed the input for a given modality, and compute the Gram matrix across tokens, from which we distill. As audio and visual modalities afford different information for a task, we adaptively modulate distillation by measuring the entropy of each modality, leading to an Entropty-Monitored Kernelized Token Distillation (EM-KTD) scheme. Our method allows flexibility in complexity of kernel function to model relationships across tokens, which are selectively distilled to ensure high-fidelity supervision for the student. We evaluate EM-KTD on VGGSound and AVS-Bench, where we use 94\% fewer parameters than the teacher while preserving 96.9\% in performance for audio-visual event recognition and 96.5\% on audio-visual segmentation.