PGT Actor-Critic

As seen in Temporal Difference Learning, the one-step return is often superior to the actual return in terms of its variance and computational congeniality, even though it introduces bias.

When the state-value function is used to assess actions, it is called a critic, and the overall policy-gradient method is termed an actor–critic method.

One-step actor–critic methods replace the full return of REINFORCE - Monte Carlo Policy Gradient with the one-step return (and use a learned state-value function as the baseline) as follows:

$$ \begin{aligned} \boldsymbol{\theta}_{t+1} & \doteq \boldsymbol{\theta}_{t}+\alpha\left(G_{t: t+1}-\hat{v}\left(S_{t}, \mathbf{w}\right)\right) \frac{\nabla \pi\left(A_{t} \mid S_{t}, \boldsymbol{\theta}_{t}\right)}{\pi\left(A_{t} \mid S_{t}, \boldsymbol{\theta}_{t}\right)} \\ &=\boldsymbol{\theta}_{t}+\alpha\left(R_{t+1}+\gamma \hat{v}\left(S_{t+1}, \mathbf{w}\right)-\hat{v}\left(S_{t}, \mathbf{w}\right)\right) \frac{\nabla \pi\left(A_{t} \mid S_{t}, \boldsymbol{\theta}_{t}\right)}{\pi\left(A_{t} \mid S_{t}, \boldsymbol{\theta}_{t}\right)} \\ &=\boldsymbol{\theta}_{t}+\alpha \delta_{t} \frac{\nabla \pi\left(A_{t} \mid S_{t}, \boldsymbol{\theta}_{t}\right)}{\pi\left(A_{t} \mid S_{t}, \boldsymbol{\theta}_{t}\right)} \end{aligned} $$
Actor Critic One-step

Policy parameterization for continuous actions

In continuous action spaces, a Gaussian policy is common. E.g., mean is some function of state $\mu(s)$. For simplicity, lets consider fixed variance of $\sigma^{2}$ (can be parametrized as well, instead) Policy is Gaussian, $a \sim \mathcal{N}\left(\mu(s), \sigma^{2}\right)$
The gradient of the log of the policy is then

$$ \nabla_{\theta} \log \pi_{\theta}(s, a)=\frac{a-\mu(s)}{\sigma^{2}} \nabla \mu(s) $$

This can be used, for instance, in REINFORCE / advantage actor critic