Multi-Armed Bandits

Created October 25, 2021 · Updated March 4, 2026

Objective is to maximize reward over a given # of time steps. For example: choosing slot machine levers to pull, doctor choosing an experimental prescription.

Reward distribution

Reward distribution can be stationary and non-stationary. Does the reward of pulling "lever 2 change over time? If you keep playing the same move, the opponent will use that against you.

Estimating value

Through repeated actions you maximize your winning by pulling the best levers? How to find the best levers?

Estimating values of each action (below: sample-average method)

Q_{t}(a) \doteq \frac{\text { sum of rewards when } a \text { taken prior to } t}{\text { number of times } a \text { taken prior to } t}=\frac{\sum_{i=1}^{t-1} R_{i} \cdot \mathbb{1}_{A_{i}=a}}{\sum_{i=1}^{t-1} \mathbb{1}_{A_{i}=a}}

Action selection

Greedy action selection $A_{t} \doteq \underset{a}{\operatorname{argmax}} Q_{t}(a)$ . With noisier and non-stationary rewards, more exploration is required so use epsilon-greedy.

Efficient-sample averaging

Incremental Implementation of Estimating Action Values

New Estimate $\leftarrow$ Old Estimate $$+$$ StepSize $$[$$ Target $$-$$ Old Estimate $$]$$

Simple Bandit Algorithm

Initialize, for $$a=1$$ to $$k:$$
$Q(a) \leftarrow 0$
$N(a) \leftarrow 0$
Loop forever:

$A \leftarrow\left\{\begin{array}{ll}\arg \max _{a} Q(a) & \text { with prob} 1 -\epsilon \\ \text { a random action } & \text { with prob } \epsilon \end{array}\right.$
$R \leftarrow$ bandit $$(A)$$
$N(A) \leftarrow N(A)+1$
$Q(A) \leftarrow Q(A)+\frac{1}{N(A)}[R-Q(A)]$

Exponential Recency-weighted Average

\begin{aligned} Q_{n+1}=& Q_{n}+\alpha\left[R_{n}-Q_{n}\right] \\ =& \alpha R_{n}+(1-\alpha) Q_{n} \\ =& \alpha R_{n}+(1-\alpha)\left[\alpha R_{n-1}+(1-\alpha) Q_{n-1}\right] \\ =& \alpha R_{n}+(1-\alpha) \alpha R_{n-1}+(1-\alpha)^{2} Q_{n-1} \\ =& \alpha R_{n}+(1-\alpha) \alpha R_{n-1}+(1-\alpha)^{2} \alpha R_{n-2}+ \cdots+(1-\alpha)^{n-1} \alpha R_{1}+(1-\alpha)^{n} Q_{1} \\ =&(1-\alpha)^{n} Q_{1}+\sum_{i=1}^{n} \alpha(1-\alpha)^{n-i} R_{i} \end{aligned}

Upper-Confidence-Bound Action Selection (UCB)

How should we select amongst non-greedy actions? It would be better to weigh the actions based on how many times they have been tested and their previous returns

A_{t} \doteq \underset{a}{\operatorname{argmax}}\left[Q_{t}(a)+c \sqrt{\frac{\ln t}{N_{t}(a)}}\right]

UCB is more difficult than epsilon-greedy to extend beyond bandits to more general RL problems (nonstationary problems, large state spaces).

The idea of this upper confidence bound (UCB) action selection is that the square-root term is a measure of the uncertainty or variance in the estimate of a’s value. The quantity being max’ed over is thus a sort of upper bound on the possible true value of action a, with the c > 0 parameter determining the confidence level. Each time a is selected the uncertainty is presumably reduced; Nt(a) is incremented and, as it appears in the denominator of the uncertainty term, the term is decreased. On the other hand, each time an action other a is selected t is increased; as it appears in the numerator the uncertainty estimate is increased. The use of the natural logarithm means that the increase gets smaller over time, but is unbounded; all actions will eventually be selected, but as time goes by it will be a longer wait, and thus a lower selection frequency, for actions with a lower value estimate or that have already been selected more times.

References

Chapter 2, RL:AI, Sutton and Barto 2nd Edition