arXiv:1512.01708 Abstract | arXiv Analytics

arXiv:1512.01708 [cs.LG]Abstract References Reviews Resources

Variance Reduction for Distributed Stochastic Gradient Descent

Published 2015-12-05Version 1

Variance reduction (VR) methods boost the performance of stochastic gradient descent (SGD) by enabling the use of larger stepsizes and preserving linear convergence rates. However, current variance reduced SGD methods require either high memory usage or require a full pass over the (large) data set at the end of each epoch to calculate the exact gradient of the objective function. This makes current VR methods impractical in distributed or parallel settings. In this paper, we propose a variance reduction method, called VR-lite, that does not require full gradient computations or extra storage. We explore distributed synchronous and asynchronous variants with both high and low communication latency. We find that our distributed algorithms scale linearly with the number of local workers and remain stable even with low communication frequency. We empirically compare both the sequential and distributed algorithms to state-of-the-art stochastic optimization methods, and find that our proposed algorithms consistently converge faster than other stochastic methods.

Comments: Preprint

Categories: cs.LG, cs.DC, math.OC, stat.ML

Keywords: distributed stochastic gradient descent, variance reduction, state-of-the-art stochastic optimization methods, current variance reduced sgd methods, low communication

Related articles: Most relevant | Search more

arXiv:1512.02970 [cs.LG] (Published 2015-12-09)

Scaling Up Distributed Stochastic Gradient Descent Using Variance Reduction

Soham De, Gavin Taylor, Tom Goldstein

arXiv:2206.00529 [cs.LG] (Published 2022-06-01)

Variance Reduction is an Antidote to Byzantines: Better Rates, Weaker Assumptions and Communication Compression as a Cherry on the Top

Eduard Gorbunov, Samuel Horváth, Peter Richtárik, Gauthier Gidel

arXiv:2411.10438 [cs.LG] (Published 2024-11-15)

MARS: Unleashing the Power of Variance Reduction for Training Large Models