arXiv Analytics

Sign in

arXiv:2307.12615 [math.OC]AbstractReferencesReviewsResources

Finite-sum optimization: Adaptivity to smoothness and loopless variance reduction

Bastien Batardière, Julien Chiquet, Joon Kwon

Published 2023-07-24Version 1

For finite-sum optimization, variance-reduced gradient methods (VR) compute at each iteration the gradient of a single function (or of a mini-batch), and yet achieve faster convergence than SGD thanks to a carefully crafted lower-variance stochastic gradient estimator that reuses past gradients. Another important line of research of the past decade in continuous optimization is the adaptive algorithms such as AdaGrad, that dynamically adjust the (possibly coordinate-wise) learning rate to past gradients and thereby adapt to the geometry of the objective function. Variants such as RMSprop and Adam demonstrate outstanding practical performance that have contributed to the success of deep learning. In this work, we present AdaVR, which combines the AdaGrad algorithm with variance-reduced gradient estimators such as SAGA or L-SVRG. We assess that AdaVR inherits both good convergence properties from VR methods and the adaptive nature of AdaGrad: in the case of $L$-smooth convex functions we establish a gradient complexity of $O(n+(L+\sqrt{nL})/\varepsilon)$ without prior knowledge of $L$. Numerical experiments demonstrate the superiority of AdaVR over state-of-the-art methods. Moreover, we empirically show that the RMSprop and Adam algorithm combined with variance-reduced gradients estimators achieve even faster convergence.

Related articles: Most relevant | Search more
arXiv:2210.05995 [math.OC] (Published 2022-10-12)
SGDA with shuffling: faster convergence for nonconvex-PŁ minimax optimization
arXiv:1908.09963 [math.OC] (Published 2019-08-27)
Deep-Learning Based Linear Average Consensus for Faster Convergence over Temporal Network
arXiv:2404.02378 [math.OC] (Published 2024-04-03)
Faster Convergence of Stochastic Accelerated Gradient Descent under Interpolation