کد مقاله کد نشریه سال انتشار مقاله انگلیسی نسخه تمام متن
435830 689942 2008 17 صفحه PDF دانلود رایگان
عنوان انگلیسی مقاله ISI
Nonstochastic bandits: Countable decision set, unbounded costs and reactive environments
موضوعات مرتبط
مهندسی و علوم پایه مهندسی کامپیوتر نظریه محاسباتی و ریاضیات
پیش نمایش صفحه اول مقاله
Nonstochastic bandits: Countable decision set, unbounded costs and reactive environments
چکیده انگلیسی

The nonstochastic multi-armed bandit problem, first studied by Auer, Cesa-Bianchi, Freund, and Schapire in 1995, is a game of repeatedly choosing one decision from a set of decisions (“experts”), under partial observation: In each round t, only the cost of the decision played is observable. A regret minimization algorithm plays this game while achieving sublinear regret relative to each decision. It is known that an adversary controlling the costs of the decisions can force the player a regret growing as in the time t. In this work, we propose the first algorithm for a countably infinite set of decisions, that achieves a regret upper bounded by , i.e. arbitrarily close to optimal order. To this aim, we build on the “follow the perturbed leader” principle, which dates back to work by Hannan in 1957. Our results hold against an adaptive adversary, for both the expected and high probability regret of the learner w.r.t. each decision. In the second part of the paper, we consider reactive problem settings, that is, situations where the learner’s decisions impact on the future behaviour of the adversary, and a strong strategy can draw benefit from well chosen past actions. We present a variant of our regret minimization algorithm which has still regret of order at most relative to such strong strategies, and even sublinear regret not exceeding w.r.t. the hypothetical (without external interference) performance of a strong strategy. We show how to combine the regret minimizer with a universal class of experts, given by the countable set of programs on some fixed universal Turing machine. This defines a universal learner with sublinear regret relative to any computable strategy.

ناشر
Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Theoretical Computer Science - Volume 397, Issues 1–3, 20 May 2008, Pages 77-93