On the bias of batch Bellman residual minimisation

Article ID	Journal	Published Year	Pages	File Type
410664	Neurocomputing	2008	4 Pages	PDF

Abstract

This letter addresses the problem of Bellman residual minimisation in reinforcement learning for the model-free batch case. We prove the simple, but not necessarily obvious result, that no unbiased estimate of the Bellman residual exists for a single trajectory of observations. We further pick up the recent suggestion of Antos et al. [Learning near-optimal policies with Bellman-residual minimisation based fitted policy iteration and a single sample path, in: COLT, 2006, pp. 574–588] for approximative Bellman residual minimisation and discuss its properties concerning consistency, biasedness, and optimality. We finally give a suggestion to improve the optimality.

Keywords

Learning theory Reinforcement learning Machine learning