PGT Actor-Critic
As seen in Temporal Difference Learning, the one-step return is often superior to the actual return in terms of its variance and computational congeniality, even though it introduces bias.
When the state-value function is used to assess actions, it is called a critic, and the overall policy-gradient method is termed an actor–critic method.
One-step actor–critic methods replace the full return of REINFORCE - Monte Carlo Policy Gradient with the one-step return (and use a learned state-value function as the baseline) as follows:
Policy parameterization for continuous actions
In continuous action spaces, a Gaussian policy is common. E.g., mean is some function of state $\mu(s)$. For simplicity, lets consider fixed variance of $\sigma^{2}$ (can be parametrized as well, instead) Policy is Gaussian, $a \sim \mathcal{N}\left(\mu(s), \sigma^{2}\right)$
The gradient of the log of the policy is then
This can be used, for instance, in REINFORCE / advantage actor critic