{
  "id": "2110.06914",
  "version": "v2",
  "published": "2021-10-13T17:50:46.000Z",
  "updated": "2022-02-02T12:09:41.000Z",
  "title": "What Happens after SGD Reaches Zero Loss? --A Mathematical Framework",
  "authors": [
    "Zhiyuan Li",
    "Tianhao Wang",
    "Sanjeev Arora"
  ],
  "comment": "47 pages, 2 figures",
  "categories": [
    "cs.LG",
    "stat.ML"
  ],
  "abstract": "Understanding the implicit bias of Stochastic Gradient Descent (SGD) is one of the key challenges in deep learning, especially for overparametrized models, where the local minimizers of the loss function $L$ can form a manifold. Intuitively, with a sufficiently small learning rate $\\eta$, SGD tracks Gradient Descent (GD) until it gets close to such manifold, where the gradient noise prevents further convergence. In such a regime, Blanc et al. (2020) proved that SGD with label noise locally decreases a regularizer-like term, the sharpness of loss, $\\mathrm{tr}[\\nabla^2 L]$. The current paper gives a general framework for such analysis by adapting ideas from Katzenberger (1991). It allows in principle a complete characterization for the regularization effect of SGD around such manifold -- i.e., the \"implicit bias\" -- using a stochastic differential equation (SDE) describing the limiting dynamics of the parameters, which is determined jointly by the loss function and the noise covariance. This yields some new results: (1) a global analysis of the implicit bias valid for $\\eta^{-2}$ steps, in contrast to the local analysis of Blanc et al. (2020) that is only valid for $\\eta^{-1.6}$ steps and (2) allowing arbitrary noise covariance. As an application, we show with arbitrary large initialization, label noise SGD can always escape the kernel regime and only requires $O(\\kappa\\ln d)$ samples for learning an $\\kappa$-sparse overparametrized linear model in $\\mathbb{R}^d$ (Woodworth et al., 2020), while GD initialized in the kernel regime requires $\\Omega(d)$ samples. This upper bound is minimax optimal and improves the previous $\\tilde{O}(\\kappa^2)$ upper bound (HaoChen et al., 2020).",
  "revisions": [
    {
      "version": "v2",
      "updated": "2022-02-02T12:09:41.000Z"
    }
  ],
  "analyses": {
    "keywords": [
      "sgd reaches zero loss",
      "mathematical framework",
      "implicit bias",
      "loss function",
      "label noise"
    ],
    "note": {
      "typesetting": "TeX",
      "pages": 47,
      "language": "en",
      "license": "arXiv",
      "status": "editable"
    }
  }
}