arXiv Analytics

Sign in

arXiv:2110.06914 [cs.LG]AbstractReferencesReviewsResources

What Happens after SGD Reaches Zero Loss? --A Mathematical Framework

Zhiyuan Li, Tianhao Wang, Sanjeev Arora

Published 2021-10-13, updated 2022-02-02Version 2

Understanding the implicit bias of Stochastic Gradient Descent (SGD) is one of the key challenges in deep learning, especially for overparametrized models, where the local minimizers of the loss function $L$ can form a manifold. Intuitively, with a sufficiently small learning rate $\eta$, SGD tracks Gradient Descent (GD) until it gets close to such manifold, where the gradient noise prevents further convergence. In such a regime, Blanc et al. (2020) proved that SGD with label noise locally decreases a regularizer-like term, the sharpness of loss, $\mathrm{tr}[\nabla^2 L]$. The current paper gives a general framework for such analysis by adapting ideas from Katzenberger (1991). It allows in principle a complete characterization for the regularization effect of SGD around such manifold -- i.e., the "implicit bias" -- using a stochastic differential equation (SDE) describing the limiting dynamics of the parameters, which is determined jointly by the loss function and the noise covariance. This yields some new results: (1) a global analysis of the implicit bias valid for $\eta^{-2}$ steps, in contrast to the local analysis of Blanc et al. (2020) that is only valid for $\eta^{-1.6}$ steps and (2) allowing arbitrary noise covariance. As an application, we show with arbitrary large initialization, label noise SGD can always escape the kernel regime and only requires $O(\kappa\ln d)$ samples for learning an $\kappa$-sparse overparametrized linear model in $\mathbb{R}^d$ (Woodworth et al., 2020), while GD initialized in the kernel regime requires $\Omega(d)$ samples. This upper bound is minimax optimal and improves the previous $\tilde{O}(\kappa^2)$ upper bound (HaoChen et al., 2020).

Comments: 47 pages, 2 figures
Categories: cs.LG, stat.ML
Related articles: Most relevant | Search more
arXiv:2210.07082 [cs.LG] (Published 2022-10-13)
Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data
arXiv:2011.03687 [cs.LG] (Published 2020-11-07)
When Optimizing $f$-divergence is Robust with Label Noise
arXiv:2009.12966 [cs.LG] (Published 2020-09-27)
Analysis of label noise in graph-based semi-supervised learning