arXiv:1004.5229 Abstract | arXiv Analytics

arXiv:1004.5229 [cs.LG]Abstract References Reviews Resources

Optimism in Reinforcement Learning and Kullback-Leibler Divergence

Sarah Filippi, Olivier Cappé, Aurélien Garivier

Published 2010-04-29, updated 2010-10-13Version 3

We consider model-based reinforcement learning in finite Markov De- cision Processes (MDPs), focussing on so-called optimistic strategies. In MDPs, optimism can be implemented by carrying out extended value it- erations under a constraint of consistency with the estimated model tran- sition probabilities. The UCRL2 algorithm by Auer, Jaksch and Ortner (2009), which follows this strategy, has recently been shown to guarantee near-optimal regret bounds. In this paper, we strongly argue in favor of using the Kullback-Leibler (KL) divergence for this purpose. By studying the linear maximization problem under KL constraints, we provide an ef- ficient algorithm, termed KL-UCRL, for solving KL-optimistic extended value iteration. Using recent deviation bounds on the KL divergence, we prove that KL-UCRL provides the same guarantees as UCRL2 in terms of regret. However, numerical experiments on classical benchmarks show a significantly improved behavior, particularly when the MDP has reduced connectivity. To support this observation, we provide elements of com- parison between the two algorithms based on geometric considerations.

Comments: This work has been accepted and presented at ALLERTON 2010; Communication, Control, and Computing (Allerton), 2010 48th Annual Allerton Conference on, Monticello (Illinois) : \'Etats-Unis (2010)

DOI: 10.1109/ALLERTON.2010.5706896

Categories: cs.LG, math.ST, stat.ML, stat.TH

Keywords: reinforcement learning, kullback-leibler divergence, guarantee near-optimal regret bounds, linear maximization problem, solving kl-optimistic extended value iteration

Tags: conference paper, journal article

Related articles: Most relevant | Search more

arXiv:1301.0601 [cs.LG] (Published 2012-12-12)

Reinforcement Learning with Partially Known World Dynamics

Christian R. Shelton

arXiv:1306.6189 [cs.LG] (Published 2013-06-26)

Scaling Up Robust MDPs by Reinforcement Learning

Aviv Tamar, Huan Xu, Shie Mannor

arXiv:2007.06168 [cs.LG] (Published 2020-07-13)