Article ID | Journal | Published Year | Pages | File Type |
---|---|---|---|---|
435646 | Theoretical Computer Science | 2010 | 12 Pages |
Abstract
We consider an upper confidence bound algorithm for learning in Markov decision processes with deterministic transitions. For this algorithm we derive upper bounds on the online regret with respect to an (ε-)optimal policy that are logarithmic in the number of steps taken. We also present a corresponding lower bound. As an application, multi-armed bandits with switching cost are considered.
Related Topics
Physical Sciences and Engineering
Computer Science
Computational Theory and Mathematics