Sample mean based index policies byO(logn) regret for the multi-armed bandit problem

1 December 1995

journal article
Published by Cambridge University Press (CUP) in Advances in Applied Probability

Vol. 27 (4) , 1054-1078
https://doi.org/10.2307/1427934

Abstract

We consider a non-Bayesian infinite horizon version of the multi-armed bandit problem with the objective of designing simple policies whoseregretincreases slowly with time. In their seminal work on this problem, Lai and Robbins had obtained aO(logn) lower bound on the regret with a constant that depends on the Kullback–Leibler number. They also constructed policies for some specific families of probability distributions (including exponential families) that achieved the lower bound. In this paper we construct index policies that depend on the rewards from each arm only through their sample mean. These policies are computationally much simpler and are also applicable much more generally. They achieve aO(logn) regret with a constant that is also based on the Kullback–Leibler number. This constant turns out to be optimal for one-parameter exponential families; however, in general it is derived from the optimal one via a ‘contraction' principle. Our results rely entirely on a few key lemmas from the theory of large deviations.

Keywords

This publication has 10 references indexed in Scilit:

Minimizing the learning loss in adaptive control of Markov chains under the weak accessibility condition
Journal of Applied Probability, 1991
Multi-armed bandit problems with multiple plays and switching cost
Stochastics and Stochastic Reports, 1990
Certainty equivalence control with forcing: revisited
Systems & Control Letters, 1989
Asymptotically efficient adaptive allocation schemes for controlled Markov chains: finite parameter space
IEEE Transactions on Automatic Control, 1989
Asymptotically efficient adaptive allocation rules for the multiarmed bandit problem with switching cost
IEEE Transactions on Automatic Control, 1988
Asymptotically efficient allocation rules for the multiarmed bandit problem with multiple plays-Part I: I.I.D. rewards
IEEE Transactions on Automatic Control, 1987
Adaptive Treatment Allocation and the Multi-Armed Bandit Problem
The Annals of Statistics, 1987
Fundamentals of statistical exponential families with applications in statistical decision theory
Published by Institute of Mathematical Statistics ,1986
Asymptotically efficient adaptive allocation rules
Advances in Applied Mathematics, 1985
Entropy, Large Deviations, and Statistical Mechanics
Published by Springer Nature ,1985