Article ID | Journal | Published Year | Pages | File Type |
---|---|---|---|---|
697381 | Automatica | 2008 | 9 Pages |
Abstract
We propose two algorithms for Q-learning that use the two-timescale stochastic approximation methodology. The first of these updates Q-values of all feasible state–action pairs at each instant while the second updates Q-values of states with actions chosen according to the ‘current’ randomized policy updates. A proof of convergence of the algorithms is shown. Finally, numerical experiments using the proposed algorithms on an application of routing in communication networks are presented on a few different settings.
Related Topics
Physical Sciences and Engineering
Engineering
Control and Systems Engineering
Authors
Shalabh Bhatnagar, K. Mohan Babu,