Article ID Journal Published Year Pages File Type
485830 Procedia Computer Science 2012 6 Pages PDF
Abstract

Q-Learning is based on value iteration and remains the most popular choice for solving Markov Decision Problems (MDPs) via reinforcement learning (RL), where the goal is to bypass the transition probabilities of the MDP. Approximate policy iteration (API) is another RL technique, not as widely used as Q-Learning, based on modified policy iteration. In this paper, we present and analyze an API algorithm for discounted reward based on (i) a classical temporal differences update for policy evaluation and (ii) simulation-based mean estimation for policy improvement. Further, we analyze for convergence API algorithms based on Q-factors for (i) discounted reward and (ii) for average reward MDPs. The average reward algorithm is based on relative value iteration; we also present results from some numerical experiments with it.

Related Topics
Physical Sciences and Engineering Computer Science Computer Science (General)