{ "id": "1806.01796", "version": "v1", "published": "2018-06-05T16:37:19.000Z", "updated": "2018-06-05T16:37:19.000Z", "title": "Stochastic Gradient Descent on Separable Data: Exact Convergence with a Fixed Learning Rate", "authors": [ "Mor Shpigel Nacson", "Nathan Srebro", "Daniel Soudry" ], "comment": "7 pages in main paper, 10 pages of proofs in appendix, 2 figures, 1 table", "categories": [ "stat.ML", "cs.LG" ], "abstract": "Stochastic Gradient Descent (SGD) is a central tool in machine learning. We prove that SGD converges to zero loss, even with a fixed learning rate --- in the special case of linear classifiers with smooth monotone loss functions, optimized on linearly separable data. Previous proofs with an exact asymptotic convergence of SGD required a learning rate that asymptotically vanishes to zero, or averaging of the SGD iterates. Furthermore, if the loss function has an exponential tail (e.g., logistic regression), then we prove that with SGD the weight vector converges in direction to the $L_2$ max margin vector as $O(1/\\log(t))$ for almost all separable datasets, and the loss converges as $O(1/t)$ --- similarly to gradient descent. These results suggest an explanation to the similar behavior observed in deep networks when trained with SGD.", "revisions": [ { "version": "v1", "updated": "2018-06-05T16:37:19.000Z" } ], "analyses": { "keywords": [ "stochastic gradient descent", "fixed learning rate", "separable data", "exact convergence", "smooth monotone loss functions" ], "note": { "typesetting": "TeX", "pages": 7, "language": "en", "license": "arXiv", "status": "editable" } } }