Policy Gradient
Advantages of Policy-Based RL
Advantages:
- Good convergence properties
- Easily extended to high-dimensional or continuous action spaces (big advantage)
- Value-based methods struggle with continuous action spaces because they need to evaluate Q-values across all possible actions to find the maximum, which becomes computationally intractable due to the infinite number of possible actions.
- Policy gradient methods bypass this by directly outputting continuous actions through parameterized policies, enabling finer control and more nuanced action selection.
- Can learn stochastic policies, which can help overcome local minimas (not appreciated as widely)
- Policy gradient methods are more suitable in this regard because they can learn stochastic policies that naturally handle environment uncertainty.
- Value-based methods typically produce deterministic policies which may struggle to adapt to the inherent randomness in stochastic environments where the same action can lead to different outcomes. Note: Value based methods could also provide stochastic policy (e.g., using softmax over Q-values), but not natively.
- Sometimes policies are simple while values and models are complex
- E.g., rich domain, but optimal is always go left
- Policy parameterizing is a good way of injecting priors
Disadvantages:
- Susceptible to local optima (especially with non-linear function approximation)
- Obtained knowledge is specific, does not always generalize well
- Ignores a lot of information in the data (when used in isolation)
Policy Objective Functions
Goal: given policy $\pi_{\theta}(s, a)$ with parameters $\theta,$ find best $\theta$
But how do we measure the quality of a policy $\pi_{\theta} ?$
In episodic environments we can use the start value
In continuing environments we can use the average value
where $\mu_{\pi}(s)=p\left(S_{t}=s \mid \pi\right)$ is the probability of being in state $s$ in the long run. Think of is as the ratio of time spent in $s$ under policy $\pi$. (Also written as $d$ instead of $\mu$).
Or the average reward per time-step
Policy Optimization
Policy based reinforcement learning is an optimization problem.
Find $\theta$ that maximises $J(\theta)$.
Some approaches do not use gradient
- Hill climbing
- Genetic algorithms
We will focus on stochastic gradient ascent, which is often quite efficient (and easy to use with deep nets)
Policy Gradient
Let $J(\theta)$ be any policy objective function
Policy gradient algorithms search for a local maximum in $J(\theta)$ by ascending the gradient of the policy, w.r.t. parameters $\theta$
Where $\nabla_{\theta} J(\theta)$ is the policy gradient
and $\alpha$ is a step-size parameter.
We need to compute an estimate of the policy gradient. We assume policy $\pi_{\theta}$ is differentiable almost everywhere.
- E.g., $\pi_{\theta}$ is a linear function of the agent state, or a neural network
- Or we could have a parameterized class of controllers
Goal is to compute
We will use Monte Carlo samples to compute this gradient.
So, how does $\mathbb{E}_{\mu}\left[v_{\pi_{\theta}}(S)\right]$ depend on $\theta$?
Consider a one-step case (a contextual bandit) such that $J(\theta)=\mathbb{E}[R(S, A)]$. (Expectation is over $d$ (states) and $\pi$ (actions))
We cannot sample $R_{t+1}$ and then take a gradient: $R_{t+1}$ is just a number that does not depend on $\theta$. Instead, we use the the score function trick:
The right-hand side gives an expected gradient that can be sampled. Our stochastic policy-gradient update is then
In expectation, this is the following the actual gradient. So this is a pure stochastic gradient algorithm.
Intuition of this gradient: increase probability for actions with high rewards.
Example: Softmax Policy
Consider a softmax policy on action preferences $h(s, a)$ as an example. Probability of action is proportional to exponentiated weight.
The gradient of the log probability is
Policy Gradient Theorem
The policy gradient theorem provides the gradient of expected cumulative reward with respect to policy parameters:
where:
- $\nabla J(\theta)$ is the gradient of expected return
- $\pi_\theta(a_t \mid s_t)$ is the policy (probability of action $a_t$ in state $s_t$)
- $Q(s_t, a_t)$ is the action-value function
Key insight: Instead of using actual returns $G_t$ (as in REINFORCE - Score Function Estimator), we can substitute Q-values since they represent expected returns. This dramatically reduces variance and enables actor-critic methods where we learn Q-values separately to update the policy more efficiently than using noisy episode returns.
Policy gradients on trajectories
- Policy gradients do not need to know the dynamics.
- Kind of surprising; shouldn't we know how the policy influences the states?
Consider trajectory $\zeta=S_{0}, A_{0}, R_{1}, S_{1}, A_{1}, R_{1}, S_{2}, \ldots$ with return $G(\zeta)$
The gradient of the transition probabilities is 0 with respect to the policy parameters as they are conditioned on the actions and not the policy parameters. Therefore,
So:
The parameters only affect your decisions, they don't affect the dynamics, this why they don't show up in the objective.
Consider trajectory $\zeta=S_{0}, A_{0}, R_{1}, S_{1}, A_{1}, R_{1}, S_{2}, \ldots$ with return $G(\zeta)$
but some of these rewards doesn't depend on some of these actions. In particular, $\sum_{t=0}^{k} R_{t+1}$ does not depend on actions $A_{k+1}, A_{k+2}, \ldots,$ so
Baselines: reduce variance
Let's assume a baseline $b$,
This holds only if $b$ does not depend on the action, though it can be a function of state.
This implies we can subtract a baseline to reduce variance. A good baseline is $v_{\pi}\left(S_{t}\right)$. So,
Typically, we estimate $v_{w}(s)$ explicitly, and sample
For instance, $G_{t}^{(1)}=R_{t+1}+\gamma v_{w}\left(S_{t+1}\right)$