Article ID Journal Published Year Pages File Type
435646 Theoretical Computer Science 2010 12 Pages PDF
Abstract

We consider an upper confidence bound algorithm for learning in Markov decision processes with deterministic transitions. For this algorithm we derive upper bounds on the online regret with respect to an (ε-)optimal policy that are logarithmic in the number of steps taken. We also present a corresponding lower bound. As an application, multi-armed bandits with switching cost are considered.

Related Topics
Physical Sciences and Engineering Computer Science Computational Theory and Mathematics