Actor-Critic--Type Learning Algorithms for Markov Decision Processes

Abstract

Algorithms for learning the optimal policy of a Markov decision process (MDP) based on simulated transitions are formulated and analyzed. These are variants of the well-known "actor-critic" (or "adaptive critic") algorithm in the artificial intelligence literature. Distributed asynchronous implementations are considered. The analysis involves two time scale stochastic approximations.

Keywords

This publication has 16 references indexed in Scilit:

Asynchronous Stochastic Approximations
SIAM Journal on Control and Optimization, 1998
A New Value Iteration method for the Average Cost Dynamic Programming Problem
SIAM Journal on Control and Optimization, 1998
The actor-critic algorithm as multi-time-scale stochastic approximation
Sādhanā, 1997
An analog scheme for fixed point computation. I. Theory
IEEE Transactions on Circuits and Systems I: Regular Papers, 1997
Stochastic approximation with two time scales
Systems & Control Letters, 1997
Recursive self-tuning control of finite Markov chains
Applicationes Mathematicae, 1997
A tutorial survey of reinforcement learning
Sādhanā, 1994
An Analysis of Stochastic Shortest Path Problems
Mathematics of Operations Research, 1991
Convergent activation dynamics in continuous time networks
Neural Networks, 1989
Applications of Singular Perturbation Techniques to Control Problems
SIAM Review, 1984