PPO - Proximal Policy Optimization

Created February 28, 2022 · Updated March 4, 2026

URL: [1707.06347] Proximal Policy Optimization Algorithms

RL suffers from high sensitivity to hyperparams and noisy estimates that may lead the agent to learn bad policy from which it may never recover.

PPO is a popular stable deep RL algorithm. Improves upon TRPO - Trust-Region Policy Optimization.

PPO benefits:
- Simpler to implement
- Sample efficient (empirically)
- Ease of tuning

Policy Gradient methods typically less sample efficient than Q- learning methods as they learn online, doesn't use replay buffer to store past experiences.

The regular policy gradient loss is given as,

L^{P G}(\theta)=\hat{E}_{t}\left[\log \pi_{\theta}\left(a_{t} \mid s_{t}\right) \hat{A}_{t}\right]

Advantage function tires to estimate relative value of selected action in the selected state. It requires:

Discounted rewards
- weighted sum of all the rewards $\sum_{k=0}^{\infty} \gamma^{k} r_{t+k}$
Baseline estimate (value function)
- estimate of the discounted return from this point onward
- state -> value function neural net -> return estimate
- Advantage estimate = discounted rewards - baseline estimate
- Positive advantage better than average returns -> gradient is positive, increase these action probabilities
- Negative advantage -> gradient is negative -> reduce the likelihood of those actions

The problem with this regular objective is that If you keep running gradient descent on a single batch, advantage estimate eventually goes wrong and you will end up destroying your policy.

A constrained optimization solution was provided by TRPO - Trust-Region Policy Optimization with "trust regions" - don't move too far from the old policy. It is enforced by a KL Divergence constraint. Then policy gradient loss becomes:

\underset{\theta}{\operatorname{maximize}} \hat{\mathbb{E}}_{t}\left[\frac{\pi_{\theta}\left(a_{t} \mid s_{t}\right)}{\pi_{\theta_{\text {old }}}\left(a_{t} \mid s_{t}\right)} \hat{A}_{t}-\beta \operatorname{KL}\left[\pi_{\theta_{\text {old }}}\left(\cdot \mid s_{t}\right), \pi_{\theta}\left(\cdot \mid s_{t}\right)\right]\right]

This constraint adds overhead and sometimes leads to some undesirable training behaviour.

PPO cleverly modifies the TRPO objective. First, TRPO loss can be written as

L^{C P I}(\theta)=\hat{\mathbb{E}}_{t}\left[\frac{\pi_{\theta}\left(a_{t} \mid s_{t}\right)}{\pi_{\theta_{\mathrm{old}}}\left(a_{t} \mid s_{t}\right)} \hat{A}_{t}\right]=\hat{\mathbb{E}}_{t}\left[r_{t}(\theta) \hat{A}_{t}\right]

which leads to PPO objective:

L^{C L I P}(\theta)= \hat{E}_{t}\left[\min \left(r_{t}(\theta) \hat{A}_{t}, \operatorname{clip}\left(r_{t}(\theta), 1-\epsilon, 1+\epsilon\right) \hat{A}_{t}\right)\right]

$\theta$ is the policy parameter
$\hat{E}_{t}$ denotes the empirical expectation over timesteps
$r_{t}$ is the ratio of the probability under the new and old policies, respectively
$\hat{A}_{t}$ is the estimated advantage at time $$t$$
$\varepsilon$ is a hyperparameter, usually 0.1 or 0.2

This effectively takes the minimum of normal policy gradient objective and the clipped version of policy gradient objective.

Things to understand:

motivation is that advantage function is noisy, so we don't want to destroy our policy based on a single estimate
advantage value can be negative and positive, so min value behaves differently as shown in the diagram

when action was good i.e A >0, flatten out return so as to not to overdo the gradient update
when action was bad i.e. A<0, don't keep reducing it's likelihood too much
difference with TRPO is PPO shows that there is no need to constraint using KL divergence with its simpler objective function

The final loss function that is (approximately) maximized each iteration is:

L_{t}^{C L I P+V F+S}(\theta)=\hat{\mathbb{E}}_{t}\left[L_{t}^{C L I P}(\theta)-c_{1} L_{t}^{V F}(\theta)+c_{2} S\left[\pi_{\theta}\right]\left(s_{t}\right)\right]

The second term updates the baseline (value function) network. Last term is the entropy term to encourage exploration.