arXiv:1911.09458 Abstract | arXiv Analytics

arXiv:1911.09458 [cs.LG]Abstract References Reviews Resources

Observe Before Play: Multi-armed Bandit with Pre-observations

Jinhang Zuo, Xiaoxi Zhang, Carlee Joe-Wong

Published 2019-11-21Version 1

We consider the stochastic multi-armed bandit (MAB) problem in a setting where a player can pay to pre-observe arm rewards before playing an arm in each round. Apart from the usual trade-off between exploring new arms to find the best one and exploiting the arm believed to offer the highest reward, we encounter an additional dilemma: pre-observing more arms gives a higher chance to play the best one, but incurs a larger cost. For the single-player setting, we design an Observe-Before-Play Upper Confidence Bound (OBP-UCB) algorithm for $K$ arms with Bernoulli rewards, and prove a $T$-round regret upper bound $O(K^2\log T)$. In the multi-player setting, collisions will occur when players select the same arm to play in the same round. We design a centralized algorithm, C-MP-OBP, and prove its $T$-round regret relative to an offline greedy strategy is upper bounded in $O(\frac{K^4}{M^2}\log T)$ for $K$ arms and $M$ players. We also propose distributed versions of the C-MP-OBP policy, called D-MP-OBP and D-MP-Adapt-OBP, achieving logarithmic regret with respect to collision-free target policies. Experiments on synthetic data and wireless channel traces show that C-MP-OBP and D-MP-OBP outperform random heuristics and offline optimal policies that do not allow pre-observations.

Categories: cs.LG, stat.ML

Keywords: multi-armed bandit, pre-observations, d-mp-obp outperform random heuristics, observe-before-play upper confidence bound, round regret upper bound

Related articles: Most relevant | Search more

arXiv:2211.06883 [cs.LG] (Published 2022-11-13)

Generalizing distribution of partial rewards for multi-armed bandits with temporally-partitioned rewards

Ronald C. van den Broek, Rik Litjens, Tobias Sagis, Luc Siecker, Nina Verbeeke, Pratik Gajane

arXiv:1911.05142 [cs.LG] (Published 2019-11-12)

Incentivized Exploration for Multi-Armed Bandits under Reward Drift

Zhiyuan Liu, Huazheng Wang, Fan Shen, Kai Liu, Lijun Chen

arXiv:1205.3181 [cs.LG] (Published 2012-05-14)

Multiple Identifications in Multi-Armed Bandits

Sébastien Bubeck, Tengyao Wang, Nitin Viswanathan