Efficient Multimodal Generation via Redundancy-Aware Mixture-of-Experts
Abstract
Multimodal foundation models are increasingly explored under diverse generation paradigms beyond classic next-token prediction. In this work, we study how autoregressive multimodal generation can be efficiently extended by exploiting latent capacity already present in models in the form of redundant parameters. We address the problem of augmenting pre-trained text-only LLMs with multimodal generative capabilities under two constraints: (C1) preserving original language generation performance, and (C2) maintaining a small parameter and data budget. Rather than introducing modality-specific modules, we leverage expert redundancy in Mixture-of-Experts (MoE) architectures as a source of latent capacity for learning a new modality. To prevent catastrophic forgetting, we apply Partial Low-Rank Adaptation (PLoRA) exclusively to tokens of the new modality, leaving text pathways unchanged. Through continual multimodal fine-tuning, our approach enables high-fidelity text-to-image generation while preserving original language performance. Further analysis shows reduced expert redundancy and the emergence of modality-specific and modality-agnostic experts, indicating implicit representation specialization within an autoregressive framework that can be leveraged for data and parameter-efficient multimodal generation. These results suggest that redundancy-aware MoE models can support data- and parameter-efficient multimodal generation, providing insight into how autoregressive objectives can serve as a strong foundation for next-generation multimodal models.