Poster
in
Workshop: Workshop on Sparsity in LLMs (SLLM): Deep Dive into Mixture of Experts, Quantization, Hardware, and Inference
One Must Imagine Experts Happy: Rebalancing Neural Routers via Constrained Optimization
Kushal Thaman
Mixture-of-Experts (MoE) models hinge on balanced token routing to avoid expert overload and training collapse. Several existing methods employ auxiliary losses to enforce load balance between experts but introduce interfering gradients that hamper optimization and complicate expert parallelism in distributed training settings. Recently proposed auxiliary-loss-free strategies still suffer from heuristic bias updates and occasional load fluctuations. We recast MoE load balancing as a formal constrained optimization problem for the first time, where expert load equality is imposed via a Lagrangian formulation. In our approach, per-expert bias terms—updated through a dual ascent rule with a damping term—serve as Lagrange multipliers that adjust routing without injecting additional gradient interference. We also improve on DeepSeek's MoE strategy (Dai et al. (2024)) by replacing hard top‑K routing with a sparsemax-based gating function, and introduce an adaptive‑K mechanism that dynamically selects the number of experts per token by enforcing explicit capacity constraints to prevent expert overload. Our method delivers robust load balance while injecting no additional gradients into the gating network by updating the biases externally. Our results on toy models demonstrate a Pareto-optimal point in load-balancing and maximizing model performance, over previous auxiliary-loss-based and auxiliary-loss-free strategies.