کد مقاله | کد نشریه | سال انتشار | مقاله انگلیسی | نسخه تمام متن |
---|---|---|---|---|
485306 | 703324 | 2013 | 11 صفحه PDF | دانلود رایگان |

The exploration–exploitation dilemma is an attractive theme in reinforcement learning. Under the tradeoff framework, a reinforcement learning agent must cleverly switch between exploration and exploitation because an action, which is estimated as the best in the current learning state, may not actually be the true best. We demonstrate that an agent can determine the best action under certain conditions even if the agent selects the exploitation phase. Under the conditions, the agent does not need an explicit exploration phase, thereby resolving the exploration–exploitation dilemma. We also propose a value function on actions and how to update this value function. The proposed method, the “overtaking method,” can be integrated with existing methods, UCB1 and UCB1-tuned, for the multi-armed bandit problem without compromising features. The integrated models show better results than the original models.
Journal: Procedia Computer Science - Volume 24, 2013, Pages 126-136