How DeepSeek built an open-weights reasoning model using pure RL with GRPO, the R1-Zero experiment, distillation into smaller models, and what open-source reasoning means for the research community.

How does GRPO work in practice?

DeepSeek-R1 - Open Source Reasoning covers DeepSeek R1, GRPO, group relative policy optimization from first principles with code examples. Free lesson at https://engineersofai.com/docs/llms/reasoning-models/deepseek-r1

What is the difference between DeepSeek R1 and group relative policy optimization?

See the full breakdown at https://engineersofai.com/docs/llms/reasoning-models/deepseek-r1

DeepSeek-R1 - Open Source Reasoning

The Night the Research Community Changed

January 20, 2025. DeepSeek releases DeepSeek-R1. Within 48 hours, the AI research community has a fully public, open-weights reasoning model that matches OpenAI o1 on most benchmarks. The technical report - exhaustive, transparent, reproducible - describes exactly how they trained it. The weights are downloadable. The training algorithm is documented. Smaller distilled versions run on consumer hardware.

For context: OpenAI spent somewhere between $50M and$ 100M training GPT-4. DeepSeek trained R1 for approximately $5.6M according to their report. They did it in China. They open-sourced it. And then, for good measure, they released a technical report that told the entire world exactly how they did it.

The timing was remarkable. For months, the research community had been speculating about what was inside o1. How do you train a model to reason? What exactly is the RL procedure? What reward signal do you use? OpenAI stayed quiet. DeepSeek answered all of it.

This lesson is about what DeepSeek-R1 actually is, how it was built, and why it matters both technically and strategically.

Why This Exists - The Closed vs. Open Split

The development of o1 created an uncomfortable situation for the AI research community. Here was arguably the most significant advance in LLM capability in two years, and it was essentially a black box. The system card gave hints. Papers from Google Brain and academic labs explored similar ideas. But no one had a reproducible, verifiable, open implementation of the core idea.

This matters for several reasons. First, it limits scientific progress - you can't build on, critique, or improve what you can't examine. Second, it creates a significant moat for OpenAI - if only they know how to train reasoning models, they maintain a durable competitive advantage. Third, it creates safety concerns - aligned AI development benefits from open scrutiny, and a key safety-relevant training paradigm being proprietary is worrying.

DeepSeek-R1 broke this dynamic. By publishing the full training procedure and releasing weights, they effectively open-sourced the reasoning model paradigm. Within weeks of release, other labs (Qwen, Mistral, smaller research groups) were replicating and extending the results.

Historical Context - DeepSeek's Approach

DeepSeek AI is a Chinese AI lab founded in 2023. They moved extremely fast. By mid-2024 they had released DeepSeek-V2, a powerful mixture-of-experts model. DeepSeek-Coder showed strong code capabilities. Then came R1.

The key intellectual insight behind R1 came from asking a bold question: can you get strong reasoning through pure reinforcement learning, without any supervised fine-tuning on reasoning demonstrations?

The standard assumption in the field was: you need human-demonstrated reasoning chains to bootstrap the model. You do SFT first to teach the model what reasoning looks like, then RL to refine it. This is what o1 likely does. It's what DeepSeek initially assumed as well.

The R1-Zero experiment challenged this assumption directly.

R1-Zero - Pure RL Without SFT Bootstrapping

R1-Zero is DeepSeek's experimental model trained with pure reinforcement learning, starting directly from the pre-trained base model with no supervised fine-tuning on reasoning demonstrations whatsoever.

The training procedure:

Start with a pre-trained base LLM (DeepSeek-V3 base)
Apply GRPO (Group Relative Policy Optimization) with rule-based rewards
Use no human-demonstrated reasoning chains

The rewards used for R1-Zero are purely rule-based:

Accuracy reward: 1.0 if the final answer is correct, 0.0 if wrong (verified with math/code checkers)
Format reward: small positive reward for using the <think>...</think> format correctly

No process reward model. No human annotations. No step-level supervision. Just: did you get the right answer?

The result was shocking: reasoning behavior emerged spontaneously from pure RL. The model learned to:

Generate extended thinking sequences before answering
Re-examine its work ("Wait, let me reconsider this")
Try alternative approaches when stuck
Self-verify answers before committing

This is a profound result. It suggests that reasoning is not a behavior that needs to be explicitly demonstrated - it's a strategy that an LLM will discover through RL if given the right reward signal and enough capacity.

# Pseudocode for GRPO training (the algorithm used for R1-Zero)

def grpo_training_step(
    policy_model,
    problem_batch,
    optimizer,
    group_size: int = 8,
    epsilon: float = 0.2,  # PPO-style clipping
    kl_beta: float = 0.01,  # KL divergence penalty
):
    """
    Group Relative Policy Optimization (GRPO) training step.

    GRPO is a simplified PPO variant that:
    1. Samples G responses per question (the "group")
    2. Uses within-group advantage normalization (no separate value network)
    3. Applies standard policy gradient with clipping

    Key advantage over PPO: no need for a separate value/critic network,
    which is expensive and hard to train for long-horizon reasoning tasks.
    """
    total_loss = 0

    for problem in problem_batch:
        # Step 1: Sample G responses from current policy
        responses = []
        for _ in range(group_size):
            response = policy_model.generate(
                prompt=format_problem(problem),
                max_tokens=4096,
                temperature=0.8,
            )
            responses.append(response)

        # Step 2: Score all responses
        rewards = []
        for response in responses:
            # Extract answer from <think>...</think><answer>...</answer> format
            answer = extract_answer(response)

            # Rule-based reward: correct = 1.0, incorrect = 0.0
            accuracy_reward = verify_answer(answer, problem.ground_truth)

            # Format reward: small bonus for proper formatting
            format_reward = 0.1 if has_proper_format(response) else 0.0

            rewards.append(accuracy_reward + format_reward)

        # Step 3: Normalize rewards within the group (this is the "relative" part)
        rewards_tensor = torch.tensor(rewards)
        mean_reward = rewards_tensor.mean()
        std_reward = rewards_tensor.std() + 1e-8
        advantages = (rewards_tensor - mean_reward) / std_reward

        # Step 4: Policy gradient with PPO-style clipping
        for response, advantage in zip(responses, advantages):
            # Compute probability ratio: new policy / old policy
            log_prob_new = policy_model.log_probability(response)
            log_prob_old = log_prob_new.detach()  # treat as reference
            ratio = torch.exp(log_prob_new - log_prob_old)

            # Clipped surrogate objective
            clipped_ratio = torch.clamp(ratio, 1 - epsilon, 1 + epsilon)
            policy_loss = -torch.min(
                ratio * advantage,
                clipped_ratio * advantage
            ).mean()

            # KL divergence penalty (keeps policy close to reference)
            kl_penalty = kl_beta * compute_kl_divergence(policy_model, response)

            total_loss += policy_loss + kl_penalty

    # Update policy
    optimizer.zero_grad()
    (total_loss / len(problem_batch)).backward()
    torch.nn.utils.clip_grad_norm_(policy_model.parameters(), 1.0)
    optimizer.step()

    return total_loss.item() / len(problem_batch)


def format_problem(problem) -> str:
    """Format problem for R1-Zero training - minimal prompting."""
    return f"""A conversation between User and Assistant. The Assistant first thinks about the reasoning process in the mind and then provides the User with the answer.
User: {problem.question}
Assistant: <think>"""

What Emerged in R1-Zero

The R1-Zero paper documents a fascinating phenomenon: as training progressed, the model's thinking sequences grew longer, and performance on math benchmarks improved. The model developed behaviors that were never explicitly demonstrated:

Self-reflection: phrases like "Wait, I made an error" appeared spontaneously
Alternative approaches: "Let me try a different method"
Verification: "Let me double-check this by plugging back in"

However, R1-Zero had significant problems:

Language mixing: it would sometimes switch between Chinese and English mid-thought
Formatting inconsistency: thinking sequences were sometimes hard to parse
Lower ceiling: without any SFT bootstrapping, the maximum capability was limited

This is why DeepSeek then built R1.

DeepSeek-R1 - Adding SFT Cold Start

Full R1 adds a "cold start" supervised fine-tuning phase before the RL training. The insight: SFT doesn't teach the model to reason (the RL will do that) but it teaches the model the format and language of reasoning, making the RL training more stable and efficient.

The four-phase pipeline:

Cold start SFT: a few thousand high-quality chain-of-thought examples in the <think>...</think> format
Reasoning RL: GRPO on math and code problems with rule-based rewards
Rejection sampling + general SFT: use R1 checkpoint to generate good responses, keep only correct ones, then fine-tune on a mix of reasoning + general instruction following
Full RL: final RL round with broader reward signals including safety and helpfulness

The cold start is crucial. Without it (R1-Zero), the model is erratic in format and language. With it, the RL training converges faster and to a higher-quality solution.

GRPO - The Training Algorithm

GRPO (Group Relative Policy Optimization) is DeepSeek's training algorithm, described in their DeepSeekMath paper (Shao et al., 2024). Understanding it is key to reproducing the R1 results.

Why Not Standard PPO?

Standard Proximal Policy Optimization (PPO) requires:

A critic/value network that estimates expected future reward for each state
Computing generalized advantage estimates (GAE) using the critic

For reasoning tasks, the "state" is a partial sequence of thinking tokens, and the "reward" only comes at the very end (when the answer is verified). The credit assignment problem is severe: the critic has to estimate, from an intermediate reasoning step, what the eventual reward will be. This is difficult and requires a large, separate value network.

GRPO eliminates the value network by using within-group normalization:

$A_i = \frac{r_i - \text{mean}(\{r_1, ..., r_G\})}{\text{std}(\{r_1, ..., r_G\})}$

where $r_i$ is the reward for response $i$ , and the mean and std are computed across the group of $G$ responses generated for the same question. This gives a relative advantage - did this response do better or worse than other responses to the same question?

The GRPO objective is:

The Night the Research Community Changed​

Why This Exists - The Closed vs. Open Split​

Historical Context - DeepSeek's Approach​

R1-Zero - Pure RL Without SFT Bootstrapping​

What Emerged in R1-Zero​

DeepSeek-R1 - Adding SFT Cold Start​

GRPO - The Training Algorithm​

Why Not Standard PPO?​