Tuning continual exploration in reinforcement learning: An optimality property of the Boltzmann strategy

کد مقاله	کد نشریه	سال انتشار	مقاله انگلیسی	نسخه تمام متن
409064	679053	2008	14 صفحه PDF	دانلود رایگان

عنوان انگلیسی مقاله ISI

دانلود مقاله + سفارش ترجمه

دانلود مقاله ISI انگلیسی

رایگان برای ایرانیان

کلمات کلیدی

Randomized strategy - استراتژی تصادفی Exploration and exploitation - اکتشاف و بهره برداری Maximum Entropy - حداکثر آنتروپی Markov decision processes - پروسه تصمیم گیری مارکوف Reinforcement learning - یادگیری تقویتی

موضوعات مرتبط

مهندسی و علوم پایه مهندسی کامپیوتر هوش مصنوعی

پیش نمایش صفحه اول مقاله

Tuning continual exploration in reinforcement learning: An optimality property of the Boltzmann strategy

چکیده انگلیسی

This paper presents a model allowing to tune continual exploration in an optimal way by integrating exploration and exploitation in a common framework. It first quantifies exploration by defining the degree of exploration of a state as the entropy of the probability distribution for choosing an admissible action in that state. Then, the exploration/exploitation tradeoff is formulated as a global optimization problem: find the exploration strategy that minimizes the expected cumulated cost, while maintaining fixed degrees of exploration at the states. In other words, maximize exploitation for constant exploration. This formulation leads to a set of nonlinear iterative equations reminiscent of the value-iteration algorithm and demonstrates that the Boltzmann strategy based on the Q -value is optimal in this sense. Convergence of those equations to a local minimum is proved for a stationary environment. Interestingly, in the deterministic case, when there is no exploration, these equations reduce to the Bellman equations for finding the shortest path. Furthermore, if the graph of states is directed and acyclic, the nonlinear equations can easily be solved by a single backward pass from the destination state. Stochastic shortest-path problems and discounted problems are also studied, and links between our algorithm and the SARSA algorithm are examined. The theoretical results are confirmed by simple simulations showing that the proposed exploration strategy outperforms the εε-greedy strategy.

ناشر

Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Neurocomputing - Volume 71, Issues 13–15, August 2008, Pages 2507–2520

نویسندگان

Youssef Achbany, François Fouss, Luh Yen, Alain Pirotte, Marco Saerens,

علوم انسانی و هنر

فنی، مهندسی و علوم پایه

پزشکی و سلامت

بیو تکنولوژی

پذیرش سفارش ترجمه

Tuning continual exploration in reinforcement learning: An optimality property of the Boltzmann strategy

دسترسی سریع

ارتباط

English Website