Machine learning and nonparametric bandit theory

1 July 1995

journal article
Published by Institute of Electrical and Electronics Engineers (IEEE) in IEEE Transactions on Automatic Control

Vol. 40 (7) , 1199-1209
https://doi.org/10.1109/9.400491

Abstract

In its most basic form, bandit theory is concerned with the design problem of sequentially choosing members from a given collection of random variables so that the regret, i.e., R/sub n/=/spl Sigma//sub j/(/spl mu/*-/spl mu//sub j/)ET/sub n/(j), grows as slowly as possible with increasing n. Here /spl mu//sub j/ is the expected value of the bandit arm (i.e., random variable) indexed by j, T/sub n/(j) is the number of times arm j has been selected in the first n decision stages, and /spl mu//sup */=sup/sub j/ /spl mu//sub j/. The present paper contributes to the theory by considering the situation in which observations are dependent. To begin with, the dependency is presumed to depend only on past observations of the same arm, but later, we allow that it may be with respect to the entire past and that the set of arms is infinite. This brings queues and, more generally, controlled Markov processes into our purview. Thus our "black-box" methodology is suitable for the case when the only observables are cost values and, in particular, the probability structure and loss function are unknown to the designer. The conclusion of the analysis is that under lenient conditions, using algorithms prescribed herein, risk growth is commensurate with that in the simplest i.i.d. cases. Our methods represent an alternative to stochastic-approximation/perturbation-analysis ideas for tuning queues.

Keywords

This publication has 17 references indexed in Scilit:

Optimization of Queues Using an Infinitesimal Perturbation Analysis-Based Stochastic Algorithm with General Update Times
SIAM Journal on Control and Optimization, 1993
Nonparametric bandit methods
Annals of Operations Research, 1991
Parallel Recursive Algorithms in Asymptotically Efficient Adaptive Control of Linear Stochastic Systems
SIAM Journal on Control and Optimization, 1991
Exponential convergence under mixing
Probability Theory and Related Fields, 1989
Large Deviation Principles for Stationary Processes
The Annals of Probability, 1988
The large deviation principle for hypermixing processes
Probability Theory and Related Fields, 1988
Asymptotically efficient allocation rules for the multiarmed bandit problem with multiple plays-Part II: Markovian rewards
IEEE Transactions on Automatic Control, 1987
Large deviations of uniformly recurrent Markov additive processes
Advances in Applied Mathematics, 1985
Asymptotically efficient adaptive allocation rules
Advances in Applied Mathematics, 1985
Some aspects of the sequential design of experiments
Bulletin of the American Mathematical Society, 1952