arXiv:1810.03264 Abstract | arXiv Analytics

arXiv:1810.03264 [cs.LG]Abstract References Reviews Resources

Toward Understanding the Impact of Staleness in Distributed Machine Learning

Wei Dai, Yi Zhou, Nanqing Dong, Hao Zhang, Eric P. Xing

Published 2018-10-08Version 1

Many distributed machine learning (ML) systems adopt the non-synchronous execution in order to alleviate the network communication bottleneck, resulting in stale parameters that do not reflect the latest updates. Despite much development in large-scale ML, the effects of staleness on learning are inconclusive as it is challenging to directly monitor or control staleness in complex distributed environments. In this work, we study the convergence behaviors of a wide array of ML models and algorithms under delayed updates. Our extensive experiments reveal the rich diversity of the effects of staleness on the convergence of ML algorithms and offer insights into seemingly contradictory reports in the literature. The empirical findings also inspire a new convergence analysis of stochastic gradient descent in non-convex optimization under staleness, matching the best-known convergence rate of O(1/\sqrt{T}).

Comments: 19 pages, 12 figures

Categories: cs.LG, cs.DC, stat.ML

Keywords: distributed machine learning, stochastic gradient descent, network communication bottleneck, best-known convergence rate, convergence behaviors

Related articles: Most relevant | Search more

arXiv:1912.09789 [cs.LG] (Published 2019-12-20)

A Survey on Distributed Machine Learning

Joost Verbraeken, Matthijs Wolting, Jonathan Katzy, Jeroen Kloppenburg, Tim Verbelen, Jan S. Rellermeyer

arXiv:1310.5426 [cs.LG] (Published 2013-10-21, updated 2013-10-25)

MLI: An API for Distributed Machine Learning

Evan R. Sparks et al.

arXiv:1802.07389 [cs.LG] (Published 2018-02-21)

3LC: Lightweight and Effective Traffic Compression for Distributed Machine Learning

Hyeontaek Lim, David G. Andersen, Michael Kaminsky

arXiv Analytics

arXiv:1810.03264 [cs.LG]Abstract References Reviews Resources

Toward Understanding the Impact of Staleness in Distributed Machine Learning

Links

Toolbox

arXiv:1810.03264 [cs.LG]AbstractReferencesReviewsResources

Toward Understanding the Impact of Staleness in Distributed Machine Learning

Links

Toolbox

arXiv:1810.03264 [cs.LG]Abstract References Reviews Resources