arXiv:2205.13589 Abstract | arXiv Analytics

arXiv:2205.13589 [cs.LG]Abstract References Reviews Resources

Pessimism in the Face of Confounders: Provably Efficient Offline Reinforcement Learning in Partially Observable Markov Decision Processes

Miao Lu, Yifei Min, Zhaoran Wang, Zhuoran Yang

Published 2022-05-26Version 1

We study offline reinforcement learning (RL) in partially observable Markov decision processes. In particular, we aim to learn an optimal policy from a dataset collected by a behavior policy which possibly depends on the latent state. Such a dataset is confounded in the sense that the latent state simultaneously affects the action and the observation, which is prohibitive for existing offline RL algorithms. To this end, we propose the \underline{P}roxy variable \underline{P}essimistic \underline{P}olicy \underline{O}ptimization (\texttt{P3O}) algorithm, which addresses the confounding bias and the distributional shift between the optimal and behavior policies in the context of general function approximation. At the core of \texttt{P3O} is a coupled sequence of pessimistic confidence regions constructed via proximal causal inference, which is formulated as minimax estimation. Under a partial coverage assumption on the confounded dataset, we prove that \texttt{P3O} achieves a $n^{-1/2}$-suboptimality, where $n$ is the number of trajectories in the dataset. To our best knowledge, \texttt{P3O} is the first provably efficient offline RL algorithm for POMDPs with a confounded dataset.

Categories: cs.LG, cs.AI, math.ST, stat.ME, stat.ML, stat.TH

Keywords: partially observable markov decision processes, provably efficient offline reinforcement learning

Related articles: Most relevant | Search more

arXiv:2111.06784 [cs.LG] (Published 2021-11-12, updated 2022-03-02)

A Minimax Learning Approach to Off-Policy Evaluation in Confounded Partially Observable Markov Decision Processes

Chengchun Shi, Masatoshi Uehara, Jiawei Huang, Nan Jiang

arXiv:2206.06426 [cs.LG] (Published 2022-06-13)

Provably Efficient Offline Reinforcement Learning with Trajectory-Wise Reward

Tengyu Xu, Yingbin Liang

arXiv:2505.11153 [cs.LG] (Published 2025-05-16)

Bi-directional Recurrence Improves Transformer in Partially Observable Markov Decision Processes

Ashok Arora, Neetesh Kumar