Heterogeneous Agent Q-weighted Policy Optimization
Abstract
Multi-agent reinforcement learning (MARL) confronts a fundamental tension between stability and expressiveness. Stability requires avoiding divergence under non-stationary updates, while expressiveness demands capturing multimodal strategies for heterogeneous coordination. Existing methods sacrifice one for the other: value-decomposition and trust-region approaches ensure stability but assume restrictive unimodal policies, while expressive generative models lack optimization guarantees. To address this challenge, we introduce Heterogeneous Agent Q-weighted Policy Optimization (HAQO), a framework unifying sequential advantage-aware updates, Q-weighted variational surrogates, and entropy regularization. Our analysis establishes monotone improvement guarantees under bounded critic bias, extending trust-region theory to diffusion-based policies with intractable log-likelihoods. HAQO achieves superior returns and reduced variance compared to policy-gradient baselines across diverse benchmarks. The ablation studies confirm sequential updates ensure stability, expressive policies enable multimodality, and entropy regularization prevents collapse. HAQO reconciles stability and expressiveness in MARL with theoretical rigor and practical effectiveness.