Comparison of Expedient and Optima Reinforcement Schemes for Learning Systems

Abstract
Stochastic automata models have been successfully used in the past for modeling learning systems. An automaton model with a variable structure reacts to inputs from a random environment by changing the probabilities of its actions. These changes are carried out using a reinforcement scheme in such a manner that the automaton evolves to a final structure which is satisfactory in some sense. Several reinforcement schemes have been proposed in the literature for updating the structure of automata [1–4]. Most of these are expedient schemes which in the limit yield structures which are better than a device that chooses the actions with equal probabilities irrespective of the environment's response. A few schemes have also been suggested recently which in the limit lead to a continuous selection of a single optimal action as the output of the automaton, when it operates in a stationary environment and are called optimal schemes [5–7]. The question naturally arises as to which of the schemes are to be preferred in practical applications. In view of the anticipated extensive use of learning schemes in multilevel decision-making systems this question of optimality versus expediency takes on particular significance. Consequently, a comparison has to be made not merely of individual automata schemes but also of the effectiveness of such schemes in situations involving several automata (e.g. stochastic games, multilevel systems).

This publication has 4 references indexed in Scilit: