Article ID | Journal | Published Year | Pages | File Type |
---|---|---|---|---|
1142264 | Operations Research Letters | 2015 | 6 Pages |
Abstract
We improved an upper bound on the expected regret of a UCB-type policy LLR for a bandit problem that repeats the following rounds: a player selects a maximal matching on a complete bipartite graph KM,NKM,N and receives a reward for each component edge of the selected matching. Rewards are assumed to be generated independently of its previous rewards according to an unknown fixed distribution. Our upper bound is smaller than the best known result (Chen et al., 2013) by a factor of Θ(M2/3)Θ(M2/3).
Related Topics
Physical Sciences and Engineering
Mathematics
Discrete Mathematics and Combinatorics
Authors
Ryo Watanabe, Atsuyoshi Nakamura, Mineichi Kudo,