Generative Africa

Group Relative Policy Optimisation (GRPO): The Reinforcement learning algorithm behind deepseek.

Author: Reuben Magala
Published On: 1/29/2025
Category: Reinforcement Learning

Introduced in April 2024 in the paper: DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models from DeepSeek-AI, Tsinghua University and Peking University. Group Relative Policy Optimization (GRPO) is a reinforcement learning algorithm that builds upon Proximal Policy Optimization (PPO) algorithim, an algorithim that was introduced back in 2017 by openAI for reinforcement learning tasks but later proved to be helpful in LLMs through reinforcement learning with human feedback. GRPO is a more simple and efficient algorithm that was initially designed to improve mathematical reasoning capabilities while reducing memory consumption in large language models and in this article we will explore how this all comes to play but before we proceed for those not familiar with Reinforcement learning, in reinforcement learning we have something called the policy, in terms of LLMs, the policy is the model's learned strategy that maps input text (states) to output text (actions). In simple terms the policy is like a brain that tells the model what to do given a certain state. Most of the methods in RL are geared towards optimizing this policy, i.e to a policy that will give us the best actions given a certain state.

What did GRPO actually solve

Earlier RL algorithms like PPO heavily relied on an external evaluator known as the critic i.e a separate value function model that estimates the value (a total sum of rewards the actor will get given it takes actions in a given state) in our case the LLM generating text is the actor, this critic in short provided feedback on how good a particular response is. However this method required more memory and computational resources since a critic is a separate model. On top of that, the values estimated by the critic in PPO are just approximations, not exact calculated values. The critic network learns to estimate the expected return (all future rewards) from a given state, but this estimation is sometimes noisy and uncertain. This makes training the critic very complex and also error prone. So when you put this in terms of training costs, GRPO actually reduces them in such a way that you aren’t going to need significant computational requirements to evaluate responses and also in training the critic. And also since the values by the critic in PPO are just estimates, LLLMs trained on this method might struggle to generalize across reasoning domains.
So in short:
• Earlier RL algorithms like PPO relied on a separate critic model to estimate the value of an action, providing feedback to the actor (LLM), but this approach required significant memory and computation resources.
• The critic's value estimates in PPO were approximations, often noisy and uncertain, making training complex and error-prone.
• GRPO reduces training costs by eliminating the need for a separate critic model, leading to lower computational requirements and better generalization across reasoning domains.

Advantages of GRPO

• No value function estimator (critic) required: this reduces memory and computational requirements.
• Group-Based Advantage Calculation: Remember LLMs basically predict the next word and in order to evaluate how good the next word is in GRPO we don’t rely on a critic but rather we calculate the reward (Remember a reward in RL is like a numerical value given to an agent after it takes an action, signifying how well that action performed in a particular state in our case the LLM generates text as the action and reward tells it how good or how bad is the text it generated) basing on average score of the group. i.e all the previous words generated as we will see later, this makes GRPO speed up scale up reward estimation by using group-based advantages
• Efficient Training: In previous methods, the KL divergence term was added directly to the reward, you can think of the KL divergence term as a way to measure how much a new policy's probability distribution differs from an older established policy. It helps the RL agent to smoothly transition between different behaviors while avoiding drastic changes that could lead to instability. GRPO integrates this term directly into the loss function. This adjustment helps stabilize the training process and improve performance in the long run

How GRPO works on a simpler level

• For each input of the question, the agent generates various outputs using the current policy.
• Each of these outputs is scored relative to the other outputs in a similar group i.e each output is given rewards basing on how well it performs compared to the others rather than being scored individually. (e.g which of the output in all the outputs is the best solution to a math question, which algorithm runs faster on leet code in a coding question e.t.c)
• The average of these rewards is used as a baseline to compute the advantages i.e We use this average to calculate the relative advantage of each response in the group compared to another. This helps us get rid of the critic and rather use the entire group to calculate the advantage and rewards.

Understanding how GRPO works on a deeper level

This section is more of a mathematical section, so if you are not interested in it, you can directly jump to the section of understanding GRPO using a real life example.

Understanding GRPO Through a real life example

Imagine you are coaching a team for a debate competition though in our case we are training an LLM to give the best responses based on the given queries. You want to train your team members to improve their debating capabilities over time. However, instead of just telling them who won like the critic does in PPO, you follow a structured approach, similar to how GRPO optimizes a language model.

Step 1: The Debate Question (Query Selection)

Just like in training a language model, we need a prompt to respond to.
Example:
The debate topic is: "Should schools switch to a four day school week?"
This is like selecting a query (q) from the training dataset P(Q) that the model will train on.

Step 2: Multiple Responses from the Team (Generating a Group of Responses)

Each student on the team provides their own answer or reasoning in response to the debate question.
Responses:
Student A: “A four day school week improves student focus and reduces burnout.”
Student B: “It may reduce costs for schools but could also lead to longer school days.”
Student C: “It’s a bad idea because students need consistent learning schedules.”
Student D: “Studies show that a four-day week can improve attendance rates.”
Similarly, in GRPO, the language model generates multiple responses to a given query.

Step 3: Evaluating Responses (Reward Calculation)

Now, we measure how good each response is based on clarity, correctness, relevance and other factors we may put in place. This is like assigning a reward to each response.
Rewards Example (on a scale from 0 to 1):
Student A: 1.0 (Strong argument, well-explained)
Student B:0.9 (Balanced perspective, but lacks depth)
Student C: 0.0 (Too vague, not well-supported)
Student D: 1.0 (Uses research-based evidence)
In GRPO, rewards could be based on accuracy, how well an answer is formatted, how fast or accurate code runs on a certain problem, language consistency. each response within the group is assessed based on already predefined criteria e.g

Accuracy Reward: Determined by the correctness of the final answer. e.g in mathematical tasks, if the model's boxed answer is correct, it receives a reward. otherwise, it does not.
Format Reward: Evaluated by how well the response adheres to a specified format. DeepSeek enforced a structure where the model's reasoning process is enclosed within tags, and the final answer within tags. Responses that were of this structure received additional rewards.
Language Consistency Reward: Assessed by the percentage of tokens in the target language, ensuring that the model's output remains in a single language without mixing.

Step 4: Comparing Responses (Advantage Calculation)

Instead of evaluating responses independently, we will compare them relative to the entire group of the generated responses

Step 5: Updating the Policy with Clipping (Controlled Learning)

After getting the scores we would want our team to improve without making drastic changes. Instead of forcing everyone to copy the best student, you encourage gradual improvement by adjusting their responses slightly.

Step 6: Penalizing Overconfident Changes (KL Divergence)

If a student suddenly starts dominating all debates with their somehow similar responses, it might mean they’re overfitting to a specific style. To prevent this, you introduce a penalty that ensures balanced learning. In GRPO, this is done using KL Divergence, which prevents the model from moving too far from previous policies

Conclusion

This algorithm drastically reduced the costs of training for deepseek, improved it’s reasoning abilities, context understanding and other AI tasks. This was novel in that deepseek took the AI world by storm.