Multi-Armed Bandits
Objective is to maximize reward over a given # of time steps. For example: choosing slot machine levers to pull, doctor choosing an experimental prescription.
Reward distribution
Reward distribution can be stationary and non-stationary. Does the reward of pulling "lever 2 change over time? If you keep playing the same move, the opponent will use that against you.
Estimating value
Through repeated actions you maximize your winning by pulling the best levers? How to find the best levers?
Estimating values of each action (below: sample-average method)
Action selection
Greedy action selection $A_{t} \doteq \underset{a}{\operatorname{argmax}} Q_{t}(a)$. With noisier and non-stationary rewards, more exploration is required so use epsilon-greedy.
Efficient-sample averaging
Incremental Implementation of Estimating Action Values
New Estimate $\leftarrow$ Old Estimate $+$ StepSize $[$ Target $-$ Old Estimate $]$
Simple Bandit Algorithm
Initialize, for $a=1$ to $k:$
$Q(a) \leftarrow 0$
$N(a) \leftarrow 0$
Loop forever:
- $A \leftarrow\left\{\begin{array}{ll}\arg \max _{a} Q(a) & \text { with prob} 1 -\epsilon \\ \text { a random action } & \text { with prob } \epsilon \end{array}\right.$
- $R \leftarrow$ bandit $(A)$
- $N(A) \leftarrow N(A)+1$
- $Q(A) \leftarrow Q(A)+\frac{1}{N(A)}[R-Q(A)]$
Exponential Recency-weighted Average
Upper-Confidence-Bound Action Selection (UCB)
How should we select amongst non-greedy actions? It would be better to weigh the actions based on how many times they have been tested and their previous returns
UCB is more difficult than epsilon-greedy to extend beyond bandits to more general RL problems (nonstationary problems, large state spaces).
The idea of this upper confidence bound (UCB) action selection is that the square-root term is a measure of the uncertainty or variance in the estimate of a’s value. The quantity being max’ed over is thus a sort of upper bound on the possible true value of action a, with the c > 0 parameter determining the confidence level. Each time a is selected the uncertainty is presumably reduced; Nt(a) is incremented and, as it appears in the denominator of the uncertainty term, the term is decreased. On the other hand, each time an action other a is selected t is increased; as it appears in the numerator the uncertainty estimate is increased. The use of the natural logarithm means that the increase gets smaller over time, but is unbounded; all actions will eventually be selected, but as time goes by it will be a longer wait, and thus a lower selection frequency, for actions with a lower value estimate or that have already been selected more times.
References
- Chapter 2, RL:AI, Sutton and Barto 2nd Edition