arXiv Analytics

Sign in

arXiv:1512.01708 [cs.LG]AbstractReferencesReviewsResources

Variance Reduction for Distributed Stochastic Gradient Descent

Soham De, Gavin Taylor, Tom Goldstein

Published 2015-12-05Version 1

Variance reduction (VR) methods boost the performance of stochastic gradient descent (SGD) by enabling the use of larger stepsizes and preserving linear convergence rates. However, current variance reduced SGD methods require either high memory usage or require a full pass over the (large) data set at the end of each epoch to calculate the exact gradient of the objective function. This makes current VR methods impractical in distributed or parallel settings. In this paper, we propose a variance reduction method, called VR-lite, that does not require full gradient computations or extra storage. We explore distributed synchronous and asynchronous variants with both high and low communication latency. We find that our distributed algorithms scale linearly with the number of local workers and remain stable even with low communication frequency. We empirically compare both the sequential and distributed algorithms to state-of-the-art stochastic optimization methods, and find that our proposed algorithms consistently converge faster than other stochastic methods.

Related articles: Most relevant | Search more
arXiv:1512.02970 [cs.LG] (Published 2015-12-09)
Scaling Up Distributed Stochastic Gradient Descent Using Variance Reduction
arXiv:2206.00529 [cs.LG] (Published 2022-06-01)
Variance Reduction is an Antidote to Byzantines: Better Rates, Weaker Assumptions and Communication Compression as a Cherry on the Top
arXiv:2411.10438 [cs.LG] (Published 2024-11-15)
MARS: Unleashing the Power of Variance Reduction for Training Large Models