arXiv:1903.08894 Abstract | arXiv Analytics

arXiv:1903.08894 [cs.LG]Abstract References Reviews Resources

Towards Characterizing Divergence in Deep Q-Learning

Joshua Achiam, Ethan Knight, Pieter Abbeel

Published 2019-03-21Version 1

Deep Q-Learning (DQL), a family of temporal difference algorithms for control, employs three techniques collectively known as the `deadly triad' in reinforcement learning: bootstrapping, off-policy learning, and function approximation. Prior work has demonstrated that together these can lead to divergence in Q-learning algorithms, but the conditions under which divergence occurs are not well-understood. In this note, we give a simple analysis based on a linear approximation to the Q-value updates, which we believe provides insight into divergence under the deadly triad. The central point in our analysis is to consider when the leading order approximation to the deep-Q update is or is not a contraction in the sup norm. Based on this analysis, we develop an algorithm which permits stable deep Q-learning for continuous control without any of the tricks conventionally used (such as target networks, adaptive gradient optimizers, or using multiple Q functions). We demonstrate that our algorithm performs above or near state-of-the-art on standard MuJoCo benchmarks from the OpenAI Gym.

Categories: cs.LG, cs.AI

Keywords: deep q-learning, characterizing divergence, standard mujoco benchmarks, deadly triad, temporal difference algorithms

Related articles: Most relevant | Search more

arXiv:2008.10870 [cs.LG] (Published 2020-08-25)

Theory of Deep Q-Learning: A Dynamical Systems Perspective

Arunselvan Ramaswamy

arXiv:1804.08619 [cs.LG] (Published 2018-04-23)

State Distribution-aware Sampling for Deep Q-learning

Weichao Li, Fuxian Huang, Xi Li, Gang Pan, Fei Wu

arXiv:1810.10469 [cs.LG] (Published 2018-10-24)

Learning Negotiating Behavior Between Cars in Intersections using Deep Q-Learning

Tommy Tram, Anton Jansson, Robin Grönberg, Mohammad Ali, Jonas Sjöberg