Policy Gradient
Policy Objective Functions
Goal: given policy $\pi_{\theta}(s, a)$ with parameters $\theta,$ find best $\theta$. But how do we measure the quality of a policy $\pi_{\theta} ?$
In episodic environments we can use the start value
In continuing environments we can use the average value
where $\mu_{\pi}(s)=p\left(S_{t}=s \mid \pi\right)$ is the probability of being in state $s$ in the long run. Think of is as the ratio of time spent in $s$ under policy $\pi$. (Also written as $d$ instead of $\mu$).
Or the average reward per time-step
Policy Gradient
Policy based reinforcement learning is an optimization problem that finds optimal policy directly, not a value function which then drives a policy. The goal is to find $\theta$ that maximises $J(\theta)$.
While there are other black-box optimization, policy gradient uses Stochastic Gradients ascent, which is often quite efficient (and easy to use with deep nets).
Let $J(\theta)$ be any policy objective function. Policy gradient algorithms search for a local maximum in $J(\theta)$ by ascending the gradient of the policy, w.r.t. parameters $\theta$
Where $\nabla_{\theta} J(\theta)$ is the policy gradient
and $\alpha$ is a step-size parameter.
We need to compute an estimate of the policy gradient. We assume policy $\pi_{\theta}$ is differentiable almost everywhere.
- E.g., $\pi_{\theta}$ is a linear function of the agent state, or a neural network
- Or we could have a parameterized class of controllers
Goal is to compute
We will use Monte Carlo samples to compute this gradient.
So, how does $\mathbb{E}_{\mu}\left[v_{\pi_{\theta}}(S)\right]$ depend on $\theta$?
Consider a one-step case (a contextual bandit) such that $J(\theta)=\mathbb{E}[R(S, A)]$. (Expectation is over $d$ (states) and $\pi$ (actions))
We cannot sample $R_{t+1}$ and then take a gradient: $R_{t+1}$ is just a number that does not depend on $\theta$. Instead, we use the the score function trick:
The right-hand side gives an expected gradient that can be sampled. Our stochastic policy-gradient update is then
In expectation, this is the following the actual gradient. So this is a pure stochastic gradient algorithm.
Intuition of this gradient: increase probability for actions with high rewards.
Policy Gradient Theorem
The policy gradient theorem provides the gradient of expected cumulative reward with respect to policy parameters:
where:
- $\nabla J(\theta)$ is the gradient of expected return
- $\pi_\theta(a_t \mid s_t)$ is the policy (probability of action $a_t$ in state $s_t$)
- $Q(s_t, a_t)$ is the action-value function
Key insight: Instead of using actual returns $G_t$ (as in REINFORCE - Score Function Estimator), we can substitute Q-values since they represent expected returns. This dramatically reduces variance and enables actor-critic methods where we learn Q-values separately to update the policy more efficiently than using noisy episode returns.
Policy gradients on trajectories
- Policy gradients do not need to know the dynamics.
- Kind of surprising; shouldn't we know how the policy influences the states?
Consider trajectory $\zeta=S_{0}, A_{0}, R_{1}, S_{1}, A_{1}, R_{1}, S_{2}, \ldots$ with return $G(\zeta)$
The gradient of the transition probabilities is 0 with respect to the policy parameters as they are conditioned on the actions and not the policy parameters. Therefore,
So:
The parameters only affect your decisions, they don't affect the dynamics, this why they don't show up in the objective.
Consider trajectory $\zeta=S_{0}, A_{0}, R_{1}, S_{1}, A_{1}, R_{1}, S_{2}, \ldots$ with return $G(\zeta)$
but some of these rewards doesn't depend on some of these actions. In particular, $\sum_{t=0}^{k} R_{t+1}$ does not depend on actions $A_{k+1}, A_{k+2}, \ldots,$ so
Baselines: reduce variance
Let's assume a baseline $b$,
This holds only if $b$ does not depend on the action, though it can be a function of state.
This implies we can subtract a baseline to reduce variance. A good baseline is $v_{\pi}\left(S_{t}\right)$. So,
Typically, we estimate $v_{w}(s)$ explicitly, and sample
For instance, $G_{t}^{(1)}=R_{t+1}+\gamma v_{w}\left(S_{t+1}\right)$
Advantages of Policy-Based RL
Advantages:
- Good convergence properties
- Easily extended to high-dimensional or continuous action spaces (big advantage)
- Value-based methods struggle with continuous action spaces because they need to evaluate Q-values across all possible actions to find the maximum, which becomes computationally intractable due to the infinite number of possible actions.
- Policy gradient methods bypass this by directly outputting continuous actions through parameterized policies, enabling finer control and more nuanced action selection.
- Can learn stochastic policies, which can help overcome local minimas (not appreciated as widely)
- Policy gradient methods are more suitable in this regard because they can learn stochastic policies that naturally handle environment uncertainty.
- Value-based methods typically produce deterministic policies which may struggle to adapt to the inherent randomness in stochastic environments where the same action can lead to different outcomes. Note: Value based methods could also provide stochastic policy (e.g., using softmax over Q-values), but not natively.
- Sometimes policies are simple while values and models are complex
- E.g., rich domain, but optimal is always go left
- Policy parameterizing is a good way of injecting priors
Disadvantages:
- Susceptible to local optima (especially with non-linear function approximation)
- Obtained knowledge is specific, does not always generalize well
- Ignores a lot of information in the data (when used in isolation)