Mixture-of-World Models: Scaling Multi-Task Reinforcement Learning with Modular Latent Dynamics
Boxuan Zhang · Weipu Zhang · Zhaohan Feng · Wei Xiao · Jian Sun · Jie Chen · Gang Wang
Abstract
A fundamental challenge in multi-task reinforcement learning (MTRL) is achieving sample efficiency in visual domains where tasks exhibit significant heterogeneity in both observations and dynamics. Model-based RL (MBRL) offers a promising path to sample efficiency through world models, but standard monolithic architectures struggle to capture diverse task dynamics, leading to poor reconstruction and prediction accuracy. We introduce the Mixture-of-World Models (MoW), a scalable architecture that integrates three key components: i) modular VAEs for task-adaptive visual compression, ii) a hybrid Transformer-based dynamics model combining task-conditioned experts with a shared backbone, and, iii) a gradient-based task clustering strategy for efficient parameter allocation. On the Atari 100k benchmark, \textbf{a single MoW agent} (trained once over Atari $26$ games) achieves a mean human-normalized score of $\mathbf{110.4}$%, competitive with the $\mathbf{114.2}$% achieved by the recent STORM—an ensemble of $26$ task-specific models—while requiring $50$% fewer parameters. On Meta-World, MoW attains a $\mathbf{74.5}$% average success rate within 300k steps, establishing a new state-of-the-art. These results demonstrate that MoW provides a scalable and parameter-efficient foundation for generalist world models. Our code is available in the supplementary materials.
Successful Page Load