arXiv:2006.06982 Abstract | arXiv Analytics

arXiv:2006.06982 [stat.ML]Abstract References Reviews Resources

Confidence Interval for Off-Policy Evaluation from Dependent Samples via Bandit Algorithm: Approach from Standardized Martingales

Published 2020-06-12Version 1

This study addresses the problem of off-policy evaluation (OPE) from dependent samples obtained via the bandit algorithm. The goal of OPE is to evaluate a new policy using historical data obtained from behavior policies generated by the bandit algorithm. Because the bandit algorithm updates the policy based on past observations, the samples are not independent and identically distributed (i.i.d.). However, several existing methods for OPE do not take this issue into account and are based on the assumption that samples are i.i.d. In this study, we address this problem by constructing an estimator from a standardized martingale difference sequence. To standardize the sequence, we consider using evaluation data or sample splitting with a two-step estimation. This technique produces an estimator with asymptotic normality without restricting a class of behavior policies. In an experiment, the proposed estimator performs better than existing methods, which assume that the behavior policy converges to a time-invariant policy.

Categories: stat.ML, cs.LG, econ.EM, stat.ME

Keywords: off-policy evaluation, dependent samples, confidence interval, estimator performs better, standardized martingale difference sequence

Related articles: Most relevant | Search more

arXiv:2212.06355 [stat.ML] (Published 2022-12-13)

A Review of Off-Policy Evaluation in Reinforcement Learning

Masatoshi Uehara, Chengchun Shi, Nathan Kallus

arXiv:2306.04836 [stat.ML] (Published 2023-06-07)

$K$-Nearest-Neighbor Resampling for Off-Policy Evaluation in Stochastic Control

Michael Giegrich, Roel Oomen, Christoph Reisinger

arXiv:2502.08993 [stat.ML] (Published 2025-02-13)

Off-Policy Evaluation for Recommendations with Missing-Not-At-Random Rewards

Tatsuki Takahashi, Chihiro Maru, Hiroko Shoji