PPO - Proximal Policy Optimization
URL: [1707.06347] Proximal Policy Optimization Algorithms
RL suffers from high sensitivity to hyperparams and noisy estimates that may lead the agent to learn bad policy from which it may never recover.
PPO is a popular stable deep RL algorithm. Improves upon TRPO - Trust-Region Policy Optimization.
PPO benefits:
- Simpler to implement
- Sample efficient (empirically)
- Ease of tuning
Policy Gradient methods typically less sample efficient than Q- learning methods as they learn online, doesn't use replay buffer to store past experiences.
The regular policy gradient loss is given as,
Advantage function tires to estimate relative value of selected action in the selected state. It requires:
- Discounted rewards
- weighted sum of all the rewards $\sum_{k=0}^{\infty} \gamma^{k} r_{t+k}$
- Baseline estimate (value function)
- estimate of the discounted return from this point onward
- state -> value function neural net -> return estimate
- Advantage estimate = discounted rewards - baseline estimate
- Positive advantage better than average returns -> gradient is positive, increase these action probabilities
- Negative advantage -> gradient is negative -> reduce the likelihood of those actions
The problem with this regular objective is that If you keep running gradient descent on a single batch, advantage estimate eventually goes wrong and you will end up destroying your policy.
A constrained optimization solution was provided by TRPO - Trust-Region Policy Optimization with "trust regions" - don't move too far from the old policy. It is enforced by a KL Divergence constraint. Then policy gradient loss becomes:
This constraint adds overhead and sometimes leads to some undesirable training behaviour.
PPO cleverly modifies the TRPO objective. First, TRPO loss can be written as
which leads to PPO objective:
- $\theta$ is the policy parameter
- $\hat{E}_{t}$ denotes the empirical expectation over timesteps
- $r_{t}$ is the ratio of the probability under the new and old policies, respectively
- $\hat{A}_{t}$ is the estimated advantage at time $t$
- $\varepsilon$ is a hyperparameter, usually 0.1 or 0.2
This effectively takes the minimum of normal policy gradient objective and the clipped version of policy gradient objective.
Things to understand:
- motivation is that advantage function is noisy, so we don't want to destroy our policy based on a single estimate
- advantage value can be negative and positive, so min value behaves differently as shown in the diagram
- when action was good i.e A >0, flatten out return so as to not to overdo the gradient update
- when action was bad i.e. A<0, don't keep reducing it's likelihood too much
- difference with TRPO is PPO shows that there is no need to constraint using KL divergence with its simpler objective function
The final loss function that is (approximately) maximized each iteration is:
The second term updates the baseline (value function) network. Last term is the entropy term to encourage exploration.