Abstract
Given a finite number of different experiments with unknown probabilitiesp1,p2, ···,pkof success, the multi-armed bandit problem is concerned with maximising the expected number of successes in a sequence of trials. There are many policies which ensure that the proportion of successes converges top= max (p1,p2, ···,pk), in the long run. This property is established for a class of decision procedures which rely on randomisation, at each stage, in selecting the experiment for the next trial. Further, it is suggested that some of these procedures might perform well over any finite sequence of trials.

This publication has 7 references indexed in Scilit: