{ "id": "1805.00869", "version": "v1", "published": "2018-05-02T15:40:24.000Z", "updated": "2018-05-02T15:40:24.000Z", "title": "Approximate Temporal Difference Learning is a Gradient Descent for Reversible Policies", "authors": [ "Yann Ollivier" ], "categories": [ "cs.LG", "math.OC", "stat.ML" ], "abstract": "In reinforcement learning, temporal difference (TD) is the most direct algorithm to learn the value function of a policy. For large or infinite state spaces, exact representations of the value function are usually not available, and it must be approximated by a function in some parametric family. However, with \\emph{nonlinear} parametric approximations (such as neural networks), TD is not guaranteed to converge to a good approximation of the true value function within the family, and is known to diverge even in relatively simple cases. TD lacks an interpretation as a stochastic gradient descent of an error between the true and approximate value functions, which would provide such guarantees. We prove that approximate TD is a gradient descent provided the current policy is \\emph{reversible}. This holds even with nonlinear approximations. A policy with transition probabilities $P(s,s')$ between states is reversible if there exists a function $\\mu$ over states such that $\\frac{P(s,s')}{P(s',s)}=\\frac{\\mu(s')}{\\mu(s)}$. In particular, every move can be undone with some probability. This condition is restrictive; it is satisfied, for instance, for a navigation problem in any unoriented graph. In this case, approximate TD is exactly a gradient descent of the \\emph{Dirichlet norm}, the norm of the difference of \\emph{gradients} between the true and approximate value functions. The Dirichlet norm also controls the bias of approximate policy gradient. These results hold even with no decay factor ($\\gamma=1$) and do not rely on contractivity of the Bellman operator, thus proving stability of TD even with $\\gamma=1$ for reversible policies.", "revisions": [ { "version": "v1", "updated": "2018-05-02T15:40:24.000Z" } ], "analyses": { "keywords": [ "gradient descent", "approximate temporal difference learning", "reversible policies", "approximate value functions", "approximate td" ], "note": { "typesetting": "TeX", "pages": 0, "language": "en", "license": "arXiv", "status": "editable" } } }