arXiv Analytics

Sign in

arXiv:2103.04550 [cs.LG]AbstractReferencesReviewsResources

No Discounted-Regret Learning in Adversarial Bandits with Delays

Ilai Bistritz, Zhengyuan Zhou, Xi Chen, Nicholas Bambos, Jose Blanchet

Published 2021-03-08Version 1

Consider a player that in each round $t$ out of $T$ rounds chooses an action and observes the incurred cost after a delay of $d_{t}$ rounds. The cost functions and the delay sequence are chosen by an adversary. We show that even if the players' algorithms lose their "no regret" property due to too large delays, the expected discounted ergodic distribution of play converges to the set of coarse correlated equilibrium (CCE) if the algorithms have "no discounted-regret". For a zero-sum game, we show that no discounted-regret is sufficient for the discounted ergodic average of play to converge to the set of Nash equilibria. We prove that the FKM algorithm with $n$ dimensions achieves a regret of $O\left(nT^{\frac{3}{4}}+\sqrt{n}T^{\frac{1}{3}}D^{\frac{1}{3}}\right)$ and the EXP3 algorithm with $K$ arms achieves a regret of $O\left(\sqrt{\ln K\left(KT+D\right)}\right)$ even when $D=\sum_{t=1}^{T}d_{t}$ and $T$ are unknown. These bounds use a novel doubling trick that provably retains the regret bound for when $D$ and $T$ are known. Using these bounds, we show that EXP3 and FKM have no discounted-regret even for $d_{t}=O\left(t\log t\right)$. Therefore, the CCE of a finite or convex unknown game can be approximated even when only delayed bandit feedback is available via simulation.

Comments: This is an extended version of the preliminary conference paper "Online EXP3 Learning in Adversarial Bandits with Delayed Feedback" published in Neurips 2019
Categories: cs.LG, cs.GT
Related articles: Most relevant | Search more
arXiv:1811.12253 [cs.LG] (Published 2018-10-23)
Unifying the stochastic and the adversarial Bandits with Knapsack
arXiv:1202.4473 [cs.LG] (Published 2012-02-20)
The best of both worlds: stochastic and adversarial bandits
arXiv:1704.04470 [cs.LG] (Published 2017-04-14)
Lean From Thy Neighbor: Stochastic & Adversarial Bandits in a Network