Incremental Implementation of Estimating Action Values

Created October 25, 2021 · Updated March 4, 2026

When calculating averages, the naive approach would be maintain a record of all rewards and perform compuation when estimated value is needed. This doesn't scale with the number of rewards seen. It is simple to devise incremental formulas instead.

Given $Q_{n}$ and the $$n$$ th reward, $R_{n}$ , the new average of all $$n$$ rewards can be computed by

\begin{aligned} Q_{n+1} &=\frac{1}{n} \sum_{i=1}^{n} R_{i} \\ &=\frac{1}{n}\left(R_{n}+\sum_{i=1}^{n-1} R_{i}\right) \\ &=\frac{1}{n}\left(R_{n}+(n-1) \frac{1}{n-1} \sum_{i=1}^{n-1} R_{i}\right) \\ &=\frac{1}{n}\left(R_{n}+(n-1) Q_{n}\right) \\ &=\frac{1}{n}\left(R_{n}+n Q_{n}-Q_{n}\right) \\ &=Q_{n}+\frac{1}{n}\left[R_{n}-Q_{n}\right] \end{aligned}

This update is of a form that occurs frequently in Reinforcement Learning problems. The general form is

\text { NewEstimate } \leftarrow \text { OldEstimate }+\text { StepSize }[\text { Target }-\text { OldEstimate }]

The expression

[\text { Target }-\text { OldEstimate }]

is an error in the estimate.

References

Chapter 2.4, RL:AI, Sutton and Barto 2nd Edition