Poster
in
Workshop: Modular, Collaborative and Decentralized Deep Learning
Momentum Look-Ahead for Asynchronous Distributed Low-Communication Training
Thalaiyasingam Ajanthan · Sameera Ramasinghe · Gil Avraham · Yan Zuo · Alexander Long
Distributed Low-Communication (DiLoCo) allows large-scale model training across geographically distributed datacenters by reducing the communication overhead in the data parallel setting. Asynchronous DiLoCo further relaxes the requirement to synchronize the model updates, eliminating any bottlenecks due to slow devices or interconnects. Nevertheless, asynchronous updates introduce stale (or delayed) gradients as model updates and gradient computation are no longer synchronized. To alleviate staleness, we introduce a look-ahead based delay correction mechanism by extrapolating the negative direction of momentum. Our experiments on language modelling tasks with decoder-only architectures demonstrate that our approach consistently outperforms asynchronous and synchronous DiLoCo methods in both homogeneous and heterogeneous settings.