Extension of the multi-armed bandit problem

Abstract
There are N independent machines. Machine i is described by a sequence {Xi(s), Fi(s)} where xi(s) is the immediate reward and Fi(s) is the information available before i is operated for the sth time. At each time one operates exactly one machine; idle machines remain frozen. The problem is to schedule the operation of the machines so as to maximize the expected total discounted sequence of rewards. The main result is that to each machine is associated an index, and the optimal policy operates the machine with the largest current index.
Keywords

This publication has 4 references indexed in Scilit: