کد مقاله کد نشریه سال انتشار مقاله انگلیسی نسخه تمام متن
405459 677641 2014 19 صفحه PDF دانلود رایگان
عنوان انگلیسی مقاله ISI
Policy oscillation is overshooting
ترجمه فارسی عنوان
نوسان در سیاست، بیش از حد است
کلمات کلیدی
تقویت یادگیری، برنامه ریزی پویا تقریبی گرادیان سیاست، گرادیان طبیعی نوسانات سیاست، چت روم سیاسی
موضوعات مرتبط
مهندسی و علوم پایه مهندسی کامپیوتر هوش مصنوعی
چکیده انگلیسی

A majority of approximate dynamic programming approaches to the reinforcement learning problem can be categorized into greedy value function methods and value-based policy gradient methods. The former approach, although fast, is well known to be susceptible to the policy oscillation phenomenon. We take a fresh view to this phenomenon by casting, within the context of non-optimistic policy iteration, a considerable subset of the former approach as a limiting special case of the latter. We explain the phenomenon in terms of this view and illustrate the underlying mechanism with artificial examples. We also use it to derive the constrained natural actor-critic algorithm that can interpolate between the aforementioned approaches. In addition, it has been suggested in the literature that the oscillation phenomenon might be subtly connected to the grossly suboptimal performance in the Tetris benchmark problem of all attempted approximate dynamic programming methods. Based on empirical findings, we offer a hypothesis that might explain the inferior performance levels and the associated policy degradation phenomenon, and which would partially support the suggested connection. Finally, we report scores in the Tetris problem that improve on existing dynamic programming based results by an order of magnitude.

ناشر
Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Neural Networks - Volume 52, April 2014, Pages 43–61
نویسندگان
,