[3] Instances of the multi-armed bandit problem include the task of iteratively allocating a fixed, limited set of resources between competing (alternative) choices in a way that minimizes the regret.
In contrast to general RL, the selected actions in bandit problems do not affect the reward distribution of the arms.
In practice, multi-armed bandits have been used to model problems such as managing research projects in a large organization, like a science foundation or a pharmaceutical company.
Herbert Robbins in 1952, realizing the importance of the problem, constructed convergent population selection strategies in "some aspects of the sequential design of experiments".
The agent attempts to balance these competing tasks in order to maximize their total value over the period of time considered.
In a generalization called the "restless bandit problem", the states of non-played arms can also evolve over time.
[18] Computer science researchers have studied multi-armed bandits under worst-case assumptions, obtaining algorithms to minimize regret in both finite and infinite (asymptotic) time horizons for both stochastic[1] and non-stochastic[19] arm payoffs.
An important variation of the classical regret minimization problem in multi-armed bandits is the one of Best Arm Identification (BAI),[20] also known as pure exploration.
This problem is crucial in various applications, including clinical trials, adaptive routing, recommendation systems, and A/B testing.
Then, in Katehakis and Robbins[22] simplifications of the policy and the main proof were given for the case of normal populations with known variances.
The next notable progress was obtained by Burnetas and Katehakis in the paper "Optimal adaptive policies for sequential allocation problems",[23] where index based policies with uniformly maximum convergence rate were constructed, under more general conditions that include the case in which the distributions of outcomes from each population depend on a vector of unknown parameters.
Later in "Optimal adaptive policies for Markov decision processes"[24] Burnetas and Katehakis studied the much larger model of Markov Decision Processes under partial information, where the transition law and/or the expected one period rewards may depend on unknown parameters.
In this work, the authors constructed an explicit form for a class of adaptive policies with uniformly maximum convergence rate properties for the total expected finite horizon reward under sufficient assumptions of finite state-action spaces and irreducibility of the transition law.
A main feature of these policies is that the choice of actions, at each state and time period, is based on indices that are inflations of the right-hand side of the estimated average reward optimality equations.
These inflations have recently been called the optimistic approach in the work of Tewari and Bartlett,[25] Ortner[26] Filippi, Cappé, and Garivier,[27] and Honda and Takemura.
[28] For Bernoulli multi-armed bandits, Pilarski et al.[29] studied computation methods of deriving fully optimal solutions (not just asymptotically) using dynamic programming in the paper "Optimal Policy for Bernoulli Bandits: Computation and Algorithm Gauge.
Pilarski et al.[30] later extended this work in "Delayed Reward Bernoulli Bandits: Optimal Policy and Predictive Meta-Algorithm PARDI"[30] to create a method of determining the optimal policy for Bernoulli bandits when rewards may not be immediately revealed following a decision and may be delayed.
When optimal solutions to multi-arm bandit tasks[31] are used to derive the value of animals' choices, the activity of neurons in the amygdala and ventral striatum encodes the values derived from these policies, and can be used to decode when the animals make exploratory versus exploitative choices.
This suggests that the optimal solutions to multi-arm bandit problems are biologically plausible, despite being computationally demanding.
[32] Many strategies exist which provide an approximate solution to the bandit problem, and can be put into the four broad categories detailed below.
All those strategies have in common a greedy behavior where the best lever (based on previous observations) is always pulled except when a (uniformly) random action is taken.
[39] Many strategies exist that provide an approximate solution to the contextual bandit problem, and can be put into two broad categories detailed below.
Source:[57] EXP3 is a popular algorithm for adversarial multiarmed bandits, suggested and analyzed in this setting by Auer et al. [2002b].
[58] In the original specification and in the above variants, the bandit problem is specified with a discrete and finite number of arms, often indicated by the variable
This framework refers to the multi-armed bandit problem in a non-stationary setting (i.e., in presence of concept drift).
Garivier and Moulines derive some of the first results with respect to bandit problems where the underlying model can change during play.
The f-dsw TS algorithm exploits a discount factor on the reward history and an arm-related sliding window to contrast concept drift in non-stationary environments.
Another work by Burtini et al. introduces a weighted least squares Thompson sampling approach (WLS-TS), which proves beneficial in both the known and unknown non-stationary cases.
The dueling bandit variant was introduced by Yue et al. (2012)[65] to model the exploration-versus-exploitation tradeoff for relative feedback.
[74] Following this work, several other researchers created algorithms to learn multiple models at the same time under bandit feedback.