arXiv Analytics

Sign in

arXiv:2502.00885 [stat.ML]AbstractReferencesReviewsResources

Algorithmic Stability of Stochastic Gradient Descent with Momentum under Heavy-Tailed Noise

Thanh Dang, Melih Barsbey, A K M Rokonuzzaman Sonet, Mert Gurbuzbalaban, Umut Simsekli, Lingjiong Zhu

Published 2025-02-02Version 1

Understanding the generalization properties of optimization algorithms under heavy-tailed noise has gained growing attention. However, the existing theoretical results mainly focus on stochastic gradient descent (SGD) and the analysis of heavy-tailed optimizers beyond SGD is still missing. In this work, we establish generalization bounds for SGD with momentum (SGDm) under heavy-tailed gradient noise. We first consider the continuous-time limit of SGDm, i.e., a Levy-driven stochastic differential equation (SDE), and establish quantitative Wasserstein algorithmic stability bounds for a class of potentially non-convex loss functions. Our bounds reveal a remarkable observation: For quadratic loss functions, we show that SGDm admits a worse generalization bound in the presence of heavy-tailed noise, indicating that the interaction of momentum and heavy tails can be harmful for generalization. We then extend our analysis to discrete-time and develop a uniform-in-time discretization error bound, which, to our knowledge, is the first result of its kind for SDEs with degenerate noise. This result shows that, with appropriately chosen step-sizes, the discrete dynamics retain the generalization properties of the limiting SDE. We illustrate our theory on both synthetic quadratic problems and neural networks.

Related articles: Most relevant | Search more
arXiv:2403.02051 [stat.ML] (Published 2024-03-04, updated 2025-05-12)
Privacy of SGD under Gaussian or Heavy-Tailed Noise: Guarantees without Gradient Clipping
arXiv:2403.13868 [stat.ML] (Published 2024-03-20)
Analysing heavy-tail properties of Stochastic Gradient Descent by means of Stochastic Recurrence Equations
arXiv:2502.06719 [stat.ML] (Published 2025-02-10)
Gaussian Approximation and Multiplier Bootstrap for Stochastic Gradient Descent