arXiv:2409.19437 Abstract | arXiv Analytics

arXiv:2409.19437 [cs.LG]Abstract References Reviews Resources

Strongly-polynomial time and validation analysis of policy gradient methods

Published 2024-09-28, updated 2024-10-23Version 2

This paper proposes a novel termination criterion, termed the advantage gap function, for finite state and action Markov decision processes (MDP) and reinforcement learning (RL). By incorporating this advantage gap function into the design of step size rules and deriving a new linear rate of convergence that is independent of the stationary state distribution of the optimal policy, we demonstrate that policy gradient methods can solve MDPs in strongly-polynomial time. To the best of our knowledge, this is the first time that such strong convergence properties have been established for policy gradient methods. Moreover, in the stochastic setting, where only stochastic estimates of policy gradients are available, we show that the advantage gap function provides close approximations of the optimality gap for each individual state and exhibits a sublinear rate of convergence at every state. The advantage gap function can be easily estimated in the stochastic case, and when coupled with easily computable upper bounds on policy values, they provide a convenient way to validate the solutions generated by policy gradient methods. Therefore, our developments offer a principled and computable measure of optimality for RL, whereas current practice tends to rely on algorithm-to-algorithm or baselines comparisons with no certificate of optimality.

Categories: cs.LG, cs.AI, cs.DS, math.OC

Subjects: 49K45, 49M05, 90C05, 90C26, 90C40, 90C46

Keywords: policy gradient methods, advantage gap function, strongly-polynomial time, validation analysis, action markov decision processes

Related articles: Most relevant | Search more

arXiv:1912.05104 [cs.LG] (Published 2019-12-11)

Entropy Regularization with Discounted Future State Distribution in Policy Gradient Methods

Riashat Islam, Raihan Seraj, Pierre-Luc Bacon, Doina Precup

arXiv:1810.02525 [cs.LG] (Published 2018-10-05)

Where Did My Optimum Go?: An Empirical Analysis of Gradient Descent Optimization in Policy Gradient Methods

Peter Henderson, Joshua Romoff, Joelle Pineau

arXiv:1904.06260 [cs.LG] (Published 2019-04-12)

Similarities between policy gradient methods (PGM) in Reinforcement learning (RL) and supervised learning (SL)

Eric Benhamou

arXiv Analytics

arXiv:2409.19437 [cs.LG]Abstract References Reviews Resources

Strongly-polynomial time and validation analysis of policy gradient methods

Links

Toolbox

arXiv:2409.19437 [cs.LG]AbstractReferencesReviewsResources

Strongly-polynomial time and validation analysis of policy gradient methods

Links

Toolbox

arXiv:2409.19437 [cs.LG]Abstract References Reviews Resources