کد مقاله | کد نشریه | سال انتشار | مقاله انگلیسی | نسخه تمام متن |
---|---|---|---|---|
752492 | 895434 | 2010 | 7 صفحه PDF | دانلود رایگان |

We develop in this article the first actor–critic reinforcement learning algorithm with function approximation for a problem of control under multiple inequality constraints. We consider the infinite horizon discounted cost framework in which both the objective and the constraint functions are suitable expected policy-dependent discounted sums of certain sample path functions. We apply the Lagrange multiplier method to handle the inequality constraints. Our algorithm makes use of multi-timescale stochastic approximation and incorporates a temporal difference (TD) critic and an actor that makes a gradient search in the space of policy parameters using efficient simultaneous perturbation stochastic approximation (SPSA) gradient estimates. We prove the asymptotic almost sure convergence of our algorithm to a locally optimal policy.
Research highlights
► We consider the problem of finding an optimal control policy for a constrained discounted cost Markov decision process when the state and action spaces can be large.
► We present the first actor-critic algorithm with function approximation for this problem.
► Our algorithm is based on the Lagrange multiplier method and combines aspects of temporal difference learning and simultaneous perturbation stochastic approximation.
► We prove the convergence of our algorithm to a constrained locally optimal policy.
Journal: Systems & Control Letters - Volume 59, Issue 12, December 2010, Pages 760–766