To Use or not to Use Muon: How Simplicity Bias in Optimizers Matters
Abstract
Out of the recently introduced optimizers, Muon has perhaps gained the highest popularity due to its superior training speed. While many papers focus on the benefits of Muon, our paper questions if there are any downsides this speedup brings. We explore the biases induced when optimizing with Muon, providing theoretical analysis and its consequences to the learning trajectories and solutions learned. While the theory does explain the benefits Muon brings, it also guides our intuition when coming up with a couple of examples where Muon optimized models may be disadvantaged, due to losing a simplicity bias. More broadly, this paper should serve as a reminder: when developing new optimizers, it is essential to consider the biases they introduce, as these biases can fundamentally change a model’s behavior—for better or for worse.