Solving Semi-Markov Decision Problems Using Average Reward Reinforcement Learning

1 April 1999

journal article
Published by Institute for Operations Research and the Management Sciences (INFORMS) in Management Science

Vol. 45 (4) , 560-574
https://doi.org/10.1287/mnsc.45.4.560

Abstract

A large class of problems of sequential decision making under uncertainty, of which the underlying probability structure is a Markov process, can be modeled as stochastic dynamic programs (referred to, in general, as Markov decision problems or MDPs). However, the computational complexity of the classical MDP algorithms, such as value iteration and policy iteration, is prohibitive and can grow intractably with the size of the problem and its related data. Furthermore, these techniques require for each action the one step transition probability and reward matrices, and obtaining these is often unrealistic for large and complex systems. Recently, there has been much interest in a simulation-based stochastic approximation framework called reinforcement learning (RL), for computing near optimal policies for MDPs. RL has been successfully applied to very large problems, such as elevator scheduling, and dynamic channel allocation of cellular telephone systems. In this paper, we extend RL to a more general class of decision tasks that are referred to as semi-Markov decision problems (SMDPs). In particular, we focus on SMDPs under the average-reward criterion. We present a new model-free RL algorithm called SMART (Semi-Markov Average Reward Technique). We present a detailed study of this algorithm on a combinatorially large problem of determining the optimal preventive maintenance schedule of a production inventory system. Numerical results from both the theoretical model and the RL algorithm are presented and compared.semi-Markov decision processes (SMDP), reinforcement learning, average reward, preventive maintenance

Keywords

This publication has 14 references indexed in Scilit:

Learning rate schedules for faster stochastic gradient search
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2003
Optimal preventive maintenance in a production inventory system
IIE Transactions, 1999
Average reward reinforcement learning: Foundations, algorithms, and empirical results
Machine Learning, 1996
Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming
Published by Elsevier ,1990
Learning to predict by the methods of temporal differences
Machine Learning, 1988
Decentralized learning in finite Markov chains
IEEE Transactions on Automatic Control, 1986
Discounted Dynamic Programming
The Annals of Mathematical Statistics, 1965
Dynamic programming, Markov chains, and the method of successive approximations
Journal of Mathematical Analysis and Applications, 1963
The structure of dynamic programing models
Naval Research Logistics Quarterly, 1955
A Stochastic Approximation Method
The Annals of Mathematical Statistics, 1951