Off-policy learning with approximation

The extension of function approximation turns out to be significantly different and harder for on-policy learning than it is for on-policy learning.

The tabular on-policy methods readily extend to semi-gradient algorithms, but these algorithms do not converge as robustly as they do under on-policy training.

Recall that in on-policy learning we seek to learn a value function for a target policy $\pi$, given data due to a different behavior policy $b$.

In the control case, action values are learned, and both policies typically change during learning-$\pi$ being the greedy policy with respect to $\hat{q}$, and $b$ being something more exploratory such as the $\varepsilon$-greedy policy with respect to $\hat{q}$.

The challenge of off-policy learning comes from:

  • Target of the update, which arises from the tabular case. Importance Sampling is used to deal with this.
  • Distribution of the updates, which arises from function approximation.

Dealing with the first challenge, we have Semi-gradient off-policy TD(0):

$$ \mathbf{w}_{t+1} \doteq \mathbf{w}_{t}+\alpha \rho_{t} \delta_{t} \nabla \hat{v}\left(S_{t}, \mathbf{w}_{t}\right) $$

where,

$$ \rho_{t} \doteq \rho_{t: t}=\frac{\pi\left(A_{t} \mid S_{t}\right)}{b\left(A_{t} \mid S_{t}\right)} $$

and $\delta_{t}$ is defined appropriately depending on whether the problem is episodic and discounted, or continuing and undiscounted using average reward:

$$ \delta_{t} \doteq R_{t+1}+\gamma \hat{v}\left(S_{t+1}, \mathbf{w}_{t}\right)-\hat{v}\left(S_{t}, \mathbf{w}_{t}\right) $$

$$ \delta_{t} \doteq R_{t+1}-\bar{R}_{t}+\hat{v}\left(S_{t+1}, \mathbf{w}_{t}\right)-\hat{v}\left(S_{t}, \mathbf{w}_{t}\right) $$