arXiv Analytics

Sign in

arXiv:1807.09089 [stat.ML]AbstractReferencesReviewsResources

Decision Variance in Online Learning

Sattar Vakili, Alexis Boukouvalas

Published 2018-07-24Version 1

Online learning has classically focused on the expected behaviour of learning policies. Recently, risk-averse online learning has gained much attention. In this paper, a risk-averse multi-armed bandit problem where the performance of policies is measured based on the mean-variance of the rewards is studied. The variance of the rewards depends on the variance of the underlying processes as well as the variance of the player's decisions. The performance of two existing policies is analyzed and new fundamental limitations on risk-averse learning is established. In particular, it is shown that although an $\mathcal{O}(\log T)$ distribution-dependent regret in time $T$ is achievable (similar to the risk-neutral setting), the worst-case (i.e. minimax) regret is lower bounded by $\Omega(T)$ (in contrast to the $\Omega(\sqrt{T})$ lower bound in the risk-neutral setting). The lower bound results are even stronger in the sense that they are proven for the case of online learning with full feedback.

Related articles: Most relevant | Search more
arXiv:1311.0468 [stat.ML] (Published 2013-11-03)
Thompson Sampling for Online Learning with Linear Experts
arXiv:1810.02567 [stat.ML] (Published 2018-10-05)
Online Learning to Rank with Features
arXiv:1208.3728 [stat.ML] (Published 2012-08-18, updated 2014-05-24)
Online Learning with Predictable Sequences