{ "id": "2110.06914", "version": "v2", "published": "2021-10-13T17:50:46.000Z", "updated": "2022-02-02T12:09:41.000Z", "title": "What Happens after SGD Reaches Zero Loss? --A Mathematical Framework", "authors": [ "Zhiyuan Li", "Tianhao Wang", "Sanjeev Arora" ], "comment": "47 pages, 2 figures", "categories": [ "cs.LG", "stat.ML" ], "abstract": "Understanding the implicit bias of Stochastic Gradient Descent (SGD) is one of the key challenges in deep learning, especially for overparametrized models, where the local minimizers of the loss function $L$ can form a manifold. Intuitively, with a sufficiently small learning rate $\\eta$, SGD tracks Gradient Descent (GD) until it gets close to such manifold, where the gradient noise prevents further convergence. In such a regime, Blanc et al. (2020) proved that SGD with label noise locally decreases a regularizer-like term, the sharpness of loss, $\\mathrm{tr}[\\nabla^2 L]$. The current paper gives a general framework for such analysis by adapting ideas from Katzenberger (1991). It allows in principle a complete characterization for the regularization effect of SGD around such manifold -- i.e., the \"implicit bias\" -- using a stochastic differential equation (SDE) describing the limiting dynamics of the parameters, which is determined jointly by the loss function and the noise covariance. This yields some new results: (1) a global analysis of the implicit bias valid for $\\eta^{-2}$ steps, in contrast to the local analysis of Blanc et al. (2020) that is only valid for $\\eta^{-1.6}$ steps and (2) allowing arbitrary noise covariance. As an application, we show with arbitrary large initialization, label noise SGD can always escape the kernel regime and only requires $O(\\kappa\\ln d)$ samples for learning an $\\kappa$-sparse overparametrized linear model in $\\mathbb{R}^d$ (Woodworth et al., 2020), while GD initialized in the kernel regime requires $\\Omega(d)$ samples. This upper bound is minimax optimal and improves the previous $\\tilde{O}(\\kappa^2)$ upper bound (HaoChen et al., 2020).", "revisions": [ { "version": "v2", "updated": "2022-02-02T12:09:41.000Z" } ], "analyses": { "keywords": [ "sgd reaches zero loss", "mathematical framework", "implicit bias", "loss function", "label noise" ], "note": { "typesetting": "TeX", "pages": 47, "language": "en", "license": "arXiv", "status": "editable" } } }