Off-policy learning with approximation

Created October 25, 2021 · Updated March 4, 2026

The extension of function approximation turns out to be significantly different and harder for on-policy learning than it is for on-policy learning.

The tabular on-policy methods readily extend to semi-gradient algorithms, but these algorithms do not converge as robustly as they do under on-policy training.

Recall that in on-policy learning we seek to learn a value function for a target policy $\pi$ , given data due to a different behavior policy $$b$$ .

In the control case, action values are learned, and both policies typically change during learning- $\pi$ being the greedy policy with respect to $\hat{q}$ , and $$b$$ being something more exploratory such as the $\varepsilon$ -greedy policy with respect to $\hat{q}$ .

The challenge of off-policy learning comes from:

Target of the update, which arises from the tabular case. Importance Sampling is used to deal with this.
Distribution of the updates, which arises from function approximation.

Dealing with the first challenge, we have Semi-gradient off-policy TD(0):

\mathbf{w}_{t+1} \doteq \mathbf{w}_{t}+\alpha \rho_{t} \delta_{t} \nabla \hat{v}\left(S_{t}, \mathbf{w}_{t}\right)

where,

\rho_{t} \doteq \rho_{t: t}=\frac{\pi\left(A_{t} \mid S_{t}\right)}{b\left(A_{t} \mid S_{t}\right)}

and $\delta_{t}$ is defined appropriately depending on whether the problem is episodic and discounted, or continuing and undiscounted using average reward:

\delta_{t} \doteq R_{t+1}+\gamma \hat{v}\left(S_{t+1}, \mathbf{w}_{t}\right)-\hat{v}\left(S_{t}, \mathbf{w}_{t}\right)

\delta_{t} \doteq R_{t+1}-\bar{R}_{t}+\hat{v}\left(S_{t+1}, \mathbf{w}_{t}\right)-\hat{v}\left(S_{t}, \mathbf{w}_{t}\right)