arXiv Analytics

Sign in

arXiv:2007.10297 [cs.LG]AbstractReferencesReviewsResources

A Short Note on Soft-max and Policy Gradients in Bandits Problems

Neil Walton

Published 2020-07-20Version 1

This is a short communication on a Lyapunov function argument for softmax in bandit problems. There are a number of excellent papers coming out using differential equations for policy gradient algorithms in reinforcement learning \cite{agarwal2019optimality,bhandari2019global,mei2020global}. We give a short argument that gives a regret bound for the soft-max ordinary differential equation for bandit problems. We derive a similar result for a different policy gradient algorithm, again for bandit problems. For this second algorithm, it is possible to prove regret bounds in the stochastic case \cite{DW20}. At the end, we summarize some ideas and issues on deriving stochastic regret bounds for policy gradients.

Related articles: Most relevant | Search more
arXiv:1202.3750 [cs.LG] (Published 2012-02-14)
Fractional Moments on Bandit Problems
arXiv:2211.16110 [cs.LG] (Published 2022-11-29)
PAC-Bayes Bounds for Bandit Problems: A Survey and Experimental Comparison
arXiv:1306.0811 [cs.LG] (Published 2013-06-04, updated 2013-11-04)
A Gang of Bandits