Group Relative Policy Optimization (GRPO)

PPO - Proximal Policy Optimization is an actor-critic method requiring a separate value model to act as variance reducing baseline (Control Variates), which becomes cumbersome for using with LLMs because you'd need to train another copy to act as the value model.

Problems with the value model:

  • Substantial resources required as it needs to be trained alongside policy, doubling the GPU requirements.
  • Training difficulty at token-level because of delayed reward which is only available at the end.

GRPO, introduced by DeepSeekMath paper, gets rid of the value model and instead estimates value by sampling a batch of trajectories, getting their rewards, and then z-normalizing the rewards with the batch statistics i.e. subtract mean and divide by variance. This batch is called a "group" and thus Group Relative Policy Optimization. I would've just gone with ZPO.

The GRPO Objective

The main differences from PPO formulation are two folds.

First, the advantage is not the usual baseline reduction with value model but group normalization. The following is the resulting simple formulation. Note that there is also a "process-reward model" variation where you rewards after each reasoning step instead of at the end.

$$ \hat{A}_{i, t}=\tilde{r}_i=\frac{r_i-\operatorname{mean}(\mathbf{r})}{\operatorname{std}(\mathbf{r})} $$

Next is the KL penalty. While original PPO formulation gets rid of KL Divergence term of TRPO - Trust-Region Policy Optimization, in general RLHF settings a KL term is added to the reward, see RLHF - Reinforcement Learning with Human Feedback > KL Divergence. GRPO gets rid of this reward modification and adds KL term to the overall loss as a separate regularization instead formulated as

$$ \mathbb{D}_{K L}\left[\pi_\theta \| \pi_{r e f}\right]=\frac{\pi_{r e f}\left(o_{i, t} \mid q, o_{i,<t}\right)}{\pi_\theta\left(o_{i, t} \mid q, o_{i,<t}\right)}-\log \frac{\pi_{r e f}\left(o_{i, t} \mid q, o_{i,<t}\right)}{\pi_\theta\left(o_{i, t} \mid q, o_{i,<t}\right)}-1, $$

The diversity of the sampled trajectory comes from just a temperature of around 1.0 and there is no explicit machinery to enforce it. People have tried rejection sampling (remove near duplicates), top-p instead of temperature, and prompting for it.

Aside: Extra Details on DeepSeekMath

  • Initializing off of a code model seems to work better for math reasoning. Curriculum matters. Mixing doesn't perform as well.

  • Training on math also improves other benchmarks like MMLU and BBH.

  • Training on arXiv papers actually brings no notable improvements or even deterioration.

  • Pretraining

    • 500B total tokens
    • Hyperparams: Adam(0.9, 0.95), l2=0.1, lr=5.3e-4, warmup=2000 steps, decay to 0.1$\times$lr after 90%.
  • SFT (instruction tuning)

    • construct instruction-tuning dataset from different fields and complexity in CoT, PoT, tool-integrated reasoning format. Total training size: 776k.
    • Randomly concatenated at maximum context length of 4k tokens.
    • Train for 500 steps with batch size of 256 and constant lr=5e-5.
  • GRPO

    • Group size of 64! although people mostly use small sizes like 8 to 16.
    • Temperature of 0.7 to 1.0 to encourage diversity.

Iterative data pipeline for building pretraining corpus

They attribute part of their success to their high quality 120B token math corpus (previous ones were 7 and 9 times smaller). Here's the iterative pipeline they ran for 4 times:

  1. Train a document classifier (embedding based on fastText) with OpenWebMath as positives and diverse web pages as negatives. Use 500,000 each for positive and negatives.
  2. Use this classifier to extract more math pages from Common Crawl (CC). Take only top N scoring pages.
  3. Group entire CC by their domain names, and calculate what % of webpages under that domain is math-related. If more than 10%, then it is a math domain.
  4. Manually annotate the URLs associated with these math domains (to verify they are indeed maths).
  5. Then re-train the classifier with enlarged dataset and repeat.
  6. They do sub-string matches of 10 words to avoid benchmark contamination.

My thoughts

  • Biggest challenge I've found is to ensure the groups are diverse enough so that rewards are well spread and mean and std are not too biased. Different temp for each trajectory helps.
  • Model collapse — There is also a possibility of collapse because lack of diversity enforcement as if the rewards are very similar, this would make most advantages 0 killing the gradients.