arXiv:2007.10297 Abstract | arXiv Analytics

arXiv:2007.10297 [cs.LG]Abstract References Reviews Resources

A Short Note on Soft-max and Policy Gradients in Bandits Problems

Published 2020-07-20Version 1

This is a short communication on a Lyapunov function argument for softmax in bandit problems. There are a number of excellent papers coming out using differential equations for policy gradient algorithms in reinforcement learning \cite{agarwal2019optimality,bhandari2019global,mei2020global}. We give a short argument that gives a regret bound for the soft-max ordinary differential equation for bandit problems. We derive a similar result for a different policy gradient algorithm, again for bandit problems. For this second algorithm, it is possible to prove regret bounds in the stochastic case \cite{DW20}. At the end, we summarize some ideas and issues on deriving stochastic regret bounds for policy gradients.

Categories: cs.LG, stat.ML

Keywords: short note, bandits problems, policy gradient algorithm, bandit problems, soft-max ordinary differential equation

Related articles: Most relevant | Search more

arXiv:1202.3750 [cs.LG] (Published 2012-02-14)

Fractional Moments on Bandit Problems

Ananda Narayanan B, Balaraman Ravindran

arXiv:2211.16110 [cs.LG] (Published 2022-11-29)

PAC-Bayes Bounds for Bandit Problems: A Survey and Experimental Comparison

Hamish Flynn, David Reeb, Melih Kandemir, Jan Peters

arXiv:1306.0811 [cs.LG] (Published 2013-06-04, updated 2013-11-04)

A Gang of Bandits

Nicolò Cesa-Bianchi, Claudio Gentile, Giovanni Zappella