Policy Gradient

Created October 4, 2021 · Updated March 20, 2026

Policy Objective Functions

Goal: given policy $\pi_{\theta}(s, a)$ with parameters $\theta,$ find best $\theta$ . But how do we measure the quality of a policy $\pi_{\theta} ?$

In episodic environments we can use the start value

J_{1}(\theta)=v_{\pi_{\theta}}\left(s_{1}\right)

In continuing environments we can use the average value

J_{a v V}(\theta)=\sum_{s} \mu_{\pi_{\theta}}(s) v_{\pi_{\theta}}(s)

where $\mu_{\pi}(s)=p\left(S_{t}=s \mid \pi\right)$ is the probability of being in state $$s$$ in the long run. Think of is as the ratio of time spent in $$s$$ under policy $\pi$ . (Also written as $$d$$ instead of $\mu$ ).

Or the average reward per time-step

J_{a v R}(\theta)=\sum_{s} \mu_{\pi_{\theta}}(s) \sum_{a} \pi_{\theta}(s, a) \sum_{r} p(r \mid s, a) r

Policy Gradient

Policy based reinforcement learning is an optimization problem that finds optimal policy directly, not a value function which then drives a policy. The goal is to find $\theta$ that maximises $J(\theta)$ .

While there are other black-box optimization, policy gradient uses Stochastic Gradients ascent, which is often quite efficient (and easy to use with deep nets).

Let $J(\theta)$ be any policy objective function. Policy gradient algorithms search for a local maximum in $J(\theta)$ by ascending the gradient of the policy, w.r.t. parameters $\theta$

\Delta \theta=\alpha \nabla_{\theta} J(\theta)

Where $\nabla_{\theta} J(\theta)$ is the policy gradient

\nabla_{\theta} J(\theta)=\left(\begin{array}{c} \frac{\partial J(\theta)}{\partial \theta_{1}} \\ \vdots \\ \frac{\partial J(\theta)}{\partial \theta_{n}} \end{array}\right)

and $\alpha$ is a step-size parameter.

We need to compute an estimate of the policy gradient. We assume policy $\pi_{\theta}$ is differentiable almost everywhere.

E.g., $\pi_{\theta}$ is a linear function of the agent state, or a neural network
Or we could have a parameterized class of controllers

Goal is to compute

\nabla_{\theta} J(\theta)=\nabla_{\theta} \mathbb{E}_{\mu}\left[v_{\pi_{\theta}}(S)\right]

We will use Monte Carlo samples to compute this gradient.

So, how does $\mathbb{E}_{\mu}\left[v_{\pi_{\theta}}(S)\right]$ depend on $\theta$ ?

Consider a one-step case (a contextual bandit) such that $J(\theta)=\mathbb{E}[R(S, A)]$ . (Expectation is over $$d$$ (states) and $\pi$ (actions))

We cannot sample $R_{t+1}$ and then take a gradient: $R_{t+1}$ is just a number that does not depend on $\theta$ . Instead, we use the the score function trick:

\begin{aligned} \nabla_{\theta} \mathbb{E}[R(S, A)] &=\nabla_{\theta} \sum_{s} d(s) \sum_{a} \pi_{\theta}(a \mid s) R(s, a) \\ &=\sum_{s} d(s) \sum_{a} \nabla_{\theta} \pi_{\theta}(a \mid s) R(s, a) \\ &=\sum_{s} d(s) \sum_{a} \pi_{\theta}(a \mid s) \frac{\nabla_{\theta} \pi_{\theta}(a \mid s)}{\pi_{\theta}(a \mid s)} R(s, a) \\ &=\sum_{s} d(s) \sum_{a} \pi_{\theta}(a \mid s) \nabla_{\theta} \log \pi_{\theta}(a \mid s) R(s, a) \\ &=\mathbb{E}\left[\nabla_{\theta} \log \pi_{\theta}(A \mid S) R(S, A)\right] \end{aligned}

The right-hand side gives an expected gradient that can be sampled. Our stochastic policy-gradient update is then

\theta_{t+1}=\theta_{t}+\alpha R_{t+1} \nabla_{\theta} \log \pi_{\theta_{t}}\left(A_{t} \mid S_{t}\right)

In expectation, this is the following the actual gradient. So this is a pure stochastic gradient algorithm.
Intuition of this gradient: increase probability for actions with high rewards.

Policy Gradient Theorem

The policy gradient theorem provides the gradient of expected cumulative reward with respect to policy parameters:

\nabla J(\theta) \propto \mathbb{E}\left[\sum_{t=0}^{\infty} \nabla \log \pi_\theta(a_t \mid s_t) \cdot Q(s_t, a_t)\right]

where:

$\nabla J(\theta)$ is the gradient of expected return
$\pi_\theta(a_t \mid s_t)$ is the policy (probability of action $$a_t$$ in state $$s_t$$ )
$$Q(s_t, a_t)$$ is the action-value function

Key insight: Instead of using actual returns $$G_t$$ (as in REINFORCE - Score Function Estimator), we can substitute Q-values since they represent expected returns. This dramatically reduces variance and enables actor-critic methods where we learn Q-values separately to update the policy more efficiently than using noisy episode returns.

Policy gradients on trajectories

Policy gradients do not need to know the dynamics.
- Kind of surprising; shouldn't we know how the policy influences the states?

Consider trajectory $\zeta=S_{0}, A_{0}, R_{1}, S_{1}, A_{1}, R_{1}, S_{2}, \ldots$ with return $G(\zeta)$

\nabla_{\theta} J_{\theta}(\pi)=\nabla_{\theta} \mathbb{E}[G(\zeta)]=\mathbb{E}\left[G(\zeta) \nabla_{\theta} \log p(\zeta)\right] \quad \text { (score function trick) }

\begin{array}{l} \nabla_{\theta} \log p(\zeta) \\ =\nabla_{\theta} [ \log \left[p\left(S_{0}\right) \pi\left(A_{0} \mid S_{0}\right) p\left(S_{1} \mid S_{0}, A_{0}\right) \pi\left(A_{1} \mid S_{1}\right) \cdots\right] \\ =\nabla_{\theta}\left[\log p\left(S_{0}\right)+\log \pi\left(A_{0} \mid S_{0}\right)+\log p\left(S_{1} \mid S_{0}, A_{0}\right)+\log \pi\left(A_{1} \mid S_{1}\right)+\cdots\right] \\ \end{array}

The gradient of the transition probabilities is 0 with respect to the policy parameters as they are conditioned on the actions and not the policy parameters. Therefore,

\nabla_{\theta} \log p(\zeta) =\nabla_{\theta}\left[\log \pi\left(A_{0} \mid S_{0}\right)+\log \pi\left(A_{1} \mid S_{1}\right)+\cdots\right]

So:

\nabla_{\theta} J_{\theta}(\pi)=\mathbb{E}\left[G(\zeta) \nabla_{\theta} \sum_{t=0} \log \pi\left(A_{t} \mid S_{t}\right)\right]=\mathbb{E}\left[\left(\sum_{t=0} \gamma R_{t+1}\right)\left(\nabla_{\theta} \sum_{t=0} \log \pi\left(A_{t} \mid S_{t}\right)\right)\right]

The parameters only affect your decisions, they don't affect the dynamics, this why they don't show up in the objective.

Consider trajectory $\zeta=S_{0}, A_{0}, R_{1}, S_{1}, A_{1}, R_{1}, S_{2}, \ldots$ with return $G(\zeta)$

\nabla_{\theta} J_{\theta}(\pi)=\mathbb{E}\left[\left(\sum_{t=0} R_{t+1}\right)\left(\nabla_{\theta} \sum_{t=0} \log \pi\left(A_{t} \mid S_{t}\right)\right)\right]

but some of these rewards doesn't depend on some of these actions. In particular, $\sum_{t=0}^{k} R_{t+1}$ does not depend on actions $A_{k+1}, A_{k+2}, \ldots,$ so

\begin{array}{l} =\mathbb{E}\left[\sum_{t=0} \nabla_{\theta} \log \pi\left(A_{t} \mid S_{t}\right) \sum_{i=0} R_{i+1}\right] \\ =\mathbb{E}\left[\sum_{t=0} \nabla_{\theta} \log \pi\left(A_{t} \mid S_{t}\right) \sum_{i=t} R_{i+1}\right] \\ =\mathbb{E}\left[\sum_{t=0} \nabla_{\theta} \log \pi\left(A_{t} \mid S_{t}\right) q_{\pi}\left(S_{t}, A_{t}\right)\right] \end{array}

Baselines: reduce variance

Let's assume a baseline $$b$$ ,

\begin{aligned} \mathbb{E}\left[b \nabla_{\theta} \log \pi\left(A_{t} \mid S_{t}\right)\right] &=\mathbb{E}\left[\sum_{a} \pi\left(a \mid S_{t}\right) b \nabla_{\theta} \log \pi\left(a \mid S_{t}\right)\right] \\ &=\mathbb{E}\left[b \nabla_{\theta} \sum_{a} \pi\left(a \mid S_{t}\right)\right] \\ &=\mathbb{E}\left[b \nabla_{\theta} 1\right] \\ &=0 \end{aligned}

This holds only if $$b$$ does not depend on the action, though it can be a function of state.

This implies we can subtract a baseline to reduce variance. A good baseline is $v_{\pi}\left(S_{t}\right)$ . So,

\nabla_{\theta} J_{\theta}(\pi)=\mathbb{E}\left[\sum_{t=0} \nabla_{\theta} \log \pi\left(A_{t} \mid S_{t}\right)\left(q_{\pi}\left(S_{t}, A_{t}\right)-v_{\pi}\left(S_{t}\right)\right)\right]

Typically, we estimate $v_{w}(s)$ explicitly, and sample

q_{\pi}\left(S_{t}, A_{t}\right) \approx G_{t}^{(n)}

For instance, $G_{t}^{(1)}=R_{t+1}+\gamma v_{w}\left(S_{t+1}\right)$

Advantages of Policy-Based RL

Advantages:

Good convergence properties
Easily extended to high-dimensional or continuous action spaces (big advantage)
- Value-based methods struggle with continuous action spaces because they need to evaluate Q-values across all possible actions to find the maximum, which becomes computationally intractable due to the infinite number of possible actions.
- Policy gradient methods bypass this by directly outputting continuous actions through parameterized policies, enabling finer control and more nuanced action selection.
Can learn stochastic policies, which can help overcome local minimas (not appreciated as widely)
- Policy gradient methods are more suitable in this regard because they can learn stochastic policies that naturally handle environment uncertainty.
- Value-based methods typically produce deterministic policies which may struggle to adapt to the inherent randomness in stochastic environments where the same action can lead to different outcomes. Note: Value based methods could also provide stochastic policy (e.g., using softmax over Q-values), but not natively.
Sometimes policies are simple while values and models are complex
- E.g., rich domain, but optimal is always go left
Policy parameterizing is a good way of injecting priors

Disadvantages:

Susceptible to local optima (especially with non-linear function approximation)
Obtained knowledge is specific, does not always generalize well
Ignores a lot of information in the data (when used in isolation)