
An improved upper bound on the expected regret of UCB-type policies for a matching-selection bandit problem
Keywords: آموزش آنلاین; Multi-armed bandit problem; Matching; Regret analysis; Combinatorial bandit; Online learning