Incremental Implementation of Estimating Action Values

When calculating averages, the naive approach would be maintain a record of all rewards and perform compuation when estimated value is needed. This doesn't scale with the number of rewards seen. It is simple to devise incremental formulas instead.

Given $Q_{n}$ and the $n$th reward, $R_{n}$, the new average of all $n$ rewards can be computed by

$$ \begin{aligned} Q_{n+1} &=\frac{1}{n} \sum_{i=1}^{n} R_{i} \\ &=\frac{1}{n}\left(R_{n}+\sum_{i=1}^{n-1} R_{i}\right) \\ &=\frac{1}{n}\left(R_{n}+(n-1) \frac{1}{n-1} \sum_{i=1}^{n-1} R_{i}\right) \\ &=\frac{1}{n}\left(R_{n}+(n-1) Q_{n}\right) \\ &=\frac{1}{n}\left(R_{n}+n Q_{n}-Q_{n}\right) \\ &=Q_{n}+\frac{1}{n}\left[R_{n}-Q_{n}\right] \end{aligned} $$

This update is of a form that occurs frequently in Reinforcement Learning problems. The general form is

$$ \text { NewEstimate } \leftarrow \text { OldEstimate }+\text { StepSize }[\text { Target }-\text { OldEstimate }] $$

The expression $[\text { Target }-\text { OldEstimate }]$ is an error in the estimate.


References

  1. Chapter 2.4, RL:AI, Sutton and Barto 2nd Edition