Skip to main content

DeepSeek-R1 - Open Source Reasoning

The Night the Research Community Changed

January 20, 2025. DeepSeek releases DeepSeek-R1. Within 48 hours, the AI research community has a fully public, open-weights reasoning model that matches OpenAI o1 on most benchmarks. The technical report - exhaustive, transparent, reproducible - describes exactly how they trained it. The weights are downloadable. The training algorithm is documented. Smaller distilled versions run on consumer hardware.

For context: OpenAI spent somewhere between 50Mand50M and 100M training GPT-4. DeepSeek trained R1 for approximately $5.6M according to their report. They did it in China. They open-sourced it. And then, for good measure, they released a technical report that told the entire world exactly how they did it.

The timing was remarkable. For months, the research community had been speculating about what was inside o1. How do you train a model to reason? What exactly is the RL procedure? What reward signal do you use? OpenAI stayed quiet. DeepSeek answered all of it.

This lesson is about what DeepSeek-R1 actually is, how it was built, and why it matters both technically and strategically.


Why This Exists - The Closed vs. Open Split

The development of o1 created an uncomfortable situation for the AI research community. Here was arguably the most significant advance in LLM capability in two years, and it was essentially a black box. The system card gave hints. Papers from Google Brain and academic labs explored similar ideas. But no one had a reproducible, verifiable, open implementation of the core idea.

This matters for several reasons. First, it limits scientific progress - you can't build on, critique, or improve what you can't examine. Second, it creates a significant moat for OpenAI - if only they know how to train reasoning models, they maintain a durable competitive advantage. Third, it creates safety concerns - aligned AI development benefits from open scrutiny, and a key safety-relevant training paradigm being proprietary is worrying.

DeepSeek-R1 broke this dynamic. By publishing the full training procedure and releasing weights, they effectively open-sourced the reasoning model paradigm. Within weeks of release, other labs (Qwen, Mistral, smaller research groups) were replicating and extending the results.


Historical Context - DeepSeek's Approach

DeepSeek AI is a Chinese AI lab founded in 2023. They moved extremely fast. By mid-2024 they had released DeepSeek-V2, a powerful mixture-of-experts model. DeepSeek-Coder showed strong code capabilities. Then came R1.

The key intellectual insight behind R1 came from asking a bold question: can you get strong reasoning through pure reinforcement learning, without any supervised fine-tuning on reasoning demonstrations?

The standard assumption in the field was: you need human-demonstrated reasoning chains to bootstrap the model. You do SFT first to teach the model what reasoning looks like, then RL to refine it. This is what o1 likely does. It's what DeepSeek initially assumed as well.

The R1-Zero experiment challenged this assumption directly.


R1-Zero - Pure RL Without SFT Bootstrapping

R1-Zero is DeepSeek's experimental model trained with pure reinforcement learning, starting directly from the pre-trained base model with no supervised fine-tuning on reasoning demonstrations whatsoever.

The training procedure:

  1. Start with a pre-trained base LLM (DeepSeek-V3 base)
  2. Apply GRPO (Group Relative Policy Optimization) with rule-based rewards
  3. Use no human-demonstrated reasoning chains

The rewards used for R1-Zero are purely rule-based:

  • Accuracy reward: 1.0 if the final answer is correct, 0.0 if wrong (verified with math/code checkers)
  • Format reward: small positive reward for using the <think>...</think> format correctly

No process reward model. No human annotations. No step-level supervision. Just: did you get the right answer?

The result was shocking: reasoning behavior emerged spontaneously from pure RL. The model learned to:

  • Generate extended thinking sequences before answering
  • Re-examine its work ("Wait, let me reconsider this")
  • Try alternative approaches when stuck
  • Self-verify answers before committing

This is a profound result. It suggests that reasoning is not a behavior that needs to be explicitly demonstrated - it's a strategy that an LLM will discover through RL if given the right reward signal and enough capacity.

# Pseudocode for GRPO training (the algorithm used for R1-Zero)

def grpo_training_step(
policy_model,
problem_batch,
optimizer,
group_size: int = 8,
epsilon: float = 0.2, # PPO-style clipping
kl_beta: float = 0.01, # KL divergence penalty
):
"""
Group Relative Policy Optimization (GRPO) training step.

GRPO is a simplified PPO variant that:
1. Samples G responses per question (the "group")
2. Uses within-group advantage normalization (no separate value network)
3. Applies standard policy gradient with clipping

Key advantage over PPO: no need for a separate value/critic network,
which is expensive and hard to train for long-horizon reasoning tasks.
"""
total_loss = 0

for problem in problem_batch:
# Step 1: Sample G responses from current policy
responses = []
for _ in range(group_size):
response = policy_model.generate(
prompt=format_problem(problem),
max_tokens=4096,
temperature=0.8,
)
responses.append(response)

# Step 2: Score all responses
rewards = []
for response in responses:
# Extract answer from <think>...</think><answer>...</answer> format
answer = extract_answer(response)

# Rule-based reward: correct = 1.0, incorrect = 0.0
accuracy_reward = verify_answer(answer, problem.ground_truth)

# Format reward: small bonus for proper formatting
format_reward = 0.1 if has_proper_format(response) else 0.0

rewards.append(accuracy_reward + format_reward)

# Step 3: Normalize rewards within the group (this is the "relative" part)
rewards_tensor = torch.tensor(rewards)
mean_reward = rewards_tensor.mean()
std_reward = rewards_tensor.std() + 1e-8
advantages = (rewards_tensor - mean_reward) / std_reward

# Step 4: Policy gradient with PPO-style clipping
for response, advantage in zip(responses, advantages):
# Compute probability ratio: new policy / old policy
log_prob_new = policy_model.log_probability(response)
log_prob_old = log_prob_new.detach() # treat as reference
ratio = torch.exp(log_prob_new - log_prob_old)

# Clipped surrogate objective
clipped_ratio = torch.clamp(ratio, 1 - epsilon, 1 + epsilon)
policy_loss = -torch.min(
ratio * advantage,
clipped_ratio * advantage
).mean()

# KL divergence penalty (keeps policy close to reference)
kl_penalty = kl_beta * compute_kl_divergence(policy_model, response)

total_loss += policy_loss + kl_penalty

# Update policy
optimizer.zero_grad()
(total_loss / len(problem_batch)).backward()
torch.nn.utils.clip_grad_norm_(policy_model.parameters(), 1.0)
optimizer.step()

return total_loss.item() / len(problem_batch)


def format_problem(problem) -> str:
"""Format problem for R1-Zero training - minimal prompting."""
return f"""A conversation between User and Assistant. The Assistant first thinks about the reasoning process in the mind and then provides the User with the answer.
User: {problem.question}
Assistant: <think>"""

What Emerged in R1-Zero

The R1-Zero paper documents a fascinating phenomenon: as training progressed, the model's thinking sequences grew longer, and performance on math benchmarks improved. The model developed behaviors that were never explicitly demonstrated:

  • Self-reflection: phrases like "Wait, I made an error" appeared spontaneously
  • Alternative approaches: "Let me try a different method"
  • Verification: "Let me double-check this by plugging back in"

However, R1-Zero had significant problems:

  • Language mixing: it would sometimes switch between Chinese and English mid-thought
  • Formatting inconsistency: thinking sequences were sometimes hard to parse
  • Lower ceiling: without any SFT bootstrapping, the maximum capability was limited

This is why DeepSeek then built R1.


DeepSeek-R1 - Adding SFT Cold Start

Full R1 adds a "cold start" supervised fine-tuning phase before the RL training. The insight: SFT doesn't teach the model to reason (the RL will do that) but it teaches the model the format and language of reasoning, making the RL training more stable and efficient.

The four-phase pipeline:

  1. Cold start SFT: a few thousand high-quality chain-of-thought examples in the <think>...</think> format
  2. Reasoning RL: GRPO on math and code problems with rule-based rewards
  3. Rejection sampling + general SFT: use R1 checkpoint to generate good responses, keep only correct ones, then fine-tune on a mix of reasoning + general instruction following
  4. Full RL: final RL round with broader reward signals including safety and helpfulness

The cold start is crucial. Without it (R1-Zero), the model is erratic in format and language. With it, the RL training converges faster and to a higher-quality solution.


GRPO - The Training Algorithm

GRPO (Group Relative Policy Optimization) is DeepSeek's training algorithm, described in their DeepSeekMath paper (Shao et al., 2024). Understanding it is key to reproducing the R1 results.

Why Not Standard PPO?

Standard Proximal Policy Optimization (PPO) requires:

  1. A critic/value network that estimates expected future reward for each state
  2. Computing generalized advantage estimates (GAE) using the critic

For reasoning tasks, the "state" is a partial sequence of thinking tokens, and the "reward" only comes at the very end (when the answer is verified). The credit assignment problem is severe: the critic has to estimate, from an intermediate reasoning step, what the eventual reward will be. This is difficult and requires a large, separate value network.

GRPO eliminates the value network by using within-group normalization:

Ai=rimean({r1,...,rG})std({r1,...,rG})A_i = \frac{r_i - \text{mean}(\{r_1, ..., r_G\})}{\text{std}(\{r_1, ..., r_G\})}

where rir_i is the reward for response ii, and the mean and std are computed across the group of GG responses generated for the same question. This gives a relative advantage - did this response do better or worse than other responses to the same question?

The GRPO objective is:

\frac{\pi_\theta(o_i | q)}{\pi_{\theta_{\text{old}}}(o_i | q)} A_i, \text{clip}\left(\frac{\pi_\theta(o_i | q)}{\pi_{\theta_{\text{old}}}(o_i | q)}, 1-\epsilon, 1+\epsilon\right) A_i \right) + \beta \cdot D_{\text{KL}}(\pi_\theta || \pi_{\text{ref}})$$ where $q$ is the question, $o_i$ is the $i$-th response, $\pi_\theta$ is the current policy, $\pi_{\theta_{\text{old}}}$ is the old policy (from the previous update), $\pi_{\text{ref}}$ is the reference policy (cold start checkpoint), and $\beta$ is the KL penalty coefficient. The key insight: by sampling $G$ responses per question and normalizing advantages within the group, you get a meaningful advantage signal without a separate critic network. This reduces memory and compute requirements significantly. ```python import torch import torch.nn.functional as F from typing import List def compute_grpo_loss( log_probs_new: torch.Tensor, # [G, seq_len] - log probs under new policy log_probs_old: torch.Tensor, # [G, seq_len] - log probs under old policy log_probs_ref: torch.Tensor, # [G, seq_len] - log probs under reference policy rewards: List[float], # [G] - reward for each response epsilon: float = 0.2, beta: float = 0.01, mask: torch.Tensor = None, # [G, seq_len] - which tokens to include ) -> torch.Tensor: """ Compute GRPO loss for a group of G responses to the same question. Args: log_probs_new: Token log probabilities under current policy log_probs_old: Token log probabilities under old policy (before this update) log_probs_ref: Token log probabilities under reference policy (cold start) rewards: Scalar reward for each complete response epsilon: PPO clipping parameter beta: KL divergence penalty coefficient mask: Which tokens to include in loss (1 = include, 0 = exclude) Returns: Scalar GRPO loss """ G = len(rewards) # Compute group-normalized advantages rewards_tensor = torch.tensor(rewards, dtype=torch.float32) mean_r = rewards_tensor.mean() std_r = rewards_tensor.std() + 1e-8 advantages = (rewards_tensor - mean_r) / std_r # [G] # Compute per-token probability ratios # Sum log probs across sequence dimension for full-sequence ratio if mask is not None: seq_log_probs_new = (log_probs_new * mask).sum(dim=-1) # [G] seq_log_probs_old = (log_probs_old * mask).sum(dim=-1) # [G] seq_log_probs_ref = (log_probs_ref * mask).sum(dim=-1) # [G] else: seq_log_probs_new = log_probs_new.sum(dim=-1) seq_log_probs_old = log_probs_old.sum(dim=-1) seq_log_probs_ref = log_probs_ref.sum(dim=-1) # Probability ratio: new policy / old policy ratio = torch.exp(seq_log_probs_new - seq_log_probs_old) # [G] # Clipped surrogate objective clipped_ratio = torch.clamp(ratio, 1 - epsilon, 1 + epsilon) surrogate = torch.min(ratio * advantages, clipped_ratio * advantages) policy_loss = -surrogate.mean() # KL divergence penalty: keep new policy close to reference # Approximate KL with log ratio kl_div = (seq_log_probs_new - seq_log_probs_ref).mean() kl_loss = beta * kl_div total_loss = policy_loss + kl_loss return total_loss ``` --- ## Distillation - Spreading Reasoning Capability One of the most practically important contributions of R1 is the distillation results. DeepSeek used full R1 to generate training data, then fine-tuned smaller models on that data. The result: small models (1.5B, 7B, 14B, 32B parameters) with reasoning capability far beyond what their size would suggest. The distillation procedure: 1. Generate many problems in the reasoning domain (math, code, logic) 2. Use R1 to generate high-quality chain-of-thought solutions (with thinking tokens) 3. Filter to keep only correct solutions 4. Fine-tune smaller base models on this (problem, thinking, answer) dataset This is **knowledge distillation through behavioral cloning** - the small model learns to imitate the reasoning behavior of R1, not by having R1's architecture, but by training on its outputs. The results are impressive: | Model | AIME 2024 | MATH-500 | Codeforces | |-------|-----------|----------|------------| | R1-Distill-Qwen-1.5B | 28.9% | 83.9% | 1,100+ rating | | R1-Distill-Qwen-7B | 55.5% | 92.8% | 1,400+ rating | | R1-Distill-Qwen-14B | 69.7% | 93.9% | 1,550+ rating | | R1-Distill-Qwen-32B | 72.6% | 94.3% | 1,600+ rating | | Full R1 (671B MoE) | 79.8% | 97.3% | 1,700+ rating | | o1-mini (OpenAI) | 63.6% | 90.0% | 1,650+ rating | A 7B distilled model beats o1-mini on MATH-500. A 14B model is competitive with o1-mini. This is extraordinary - and it means high-quality reasoning is now available on hardware that anyone can run. ```python # Distillation data generation pipeline def generate_distillation_dataset( teacher_model, # Full R1 or equivalent problems: list, n_samples_per_problem: int = 8, min_correct_per_problem: int = 1, ) -> list: """ Generate a distillation dataset by: 1. Sampling multiple solutions from the teacher 2. Filtering to correct solutions only 3. Collecting (problem, thinking, answer) triples Args: teacher_model: The large reasoning model to distill from problems: List of problems with ground truth answers n_samples_per_problem: How many solutions to sample min_correct_per_problem: Minimum correct solutions required Returns: List of training examples for distillation """ distillation_data = [] for problem in problems: correct_solutions = [] for _ in range(n_samples_per_problem): # Generate thinking + answer from teacher thinking, answer = teacher_model.generate_with_thinking( problem=problem.question, max_thinking_tokens=8192, temperature=0.7, ) # Verify correctness if verify_answer(answer, problem.ground_truth): correct_solutions.append({ "problem": problem.question, "thinking": thinking, "answer": answer, }) # Only include problems where teacher succeeded at least once if len(correct_solutions) >= min_correct_per_problem: # Add all correct solutions (more data from this problem) distillation_data.extend(correct_solutions) print(f"Generated {len(distillation_data)} training examples") print(f"Coverage: {len(distillation_data)/len(problems):.1%} of problems") return distillation_data def train_distilled_model( student_model, distillation_data: list, optimizer, n_epochs: int = 3, batch_size: int = 16, ): """ Fine-tune a small model on distillation data. This is standard supervised fine-tuning - just on teacher outputs. """ for epoch in range(n_epochs): total_loss = 0 for batch_start in range(0, len(distillation_data), batch_size): batch = distillation_data[batch_start:batch_start + batch_size] # Format each example as the expected output format formatted_examples = [] for example in batch: formatted = ( f"User: {example['problem']}\n" f"Assistant: <think>\n{example['thinking']}\n</think>\n\n" f"{example['answer']}" ) formatted_examples.append(formatted) # Standard language modeling loss (predict next token) loss = student_model.compute_sft_loss(formatted_examples) optimizer.zero_grad() loss.backward() optimizer.step() total_loss += loss.item() avg_loss = total_loss / (len(distillation_data) / batch_size) print(f"Epoch {epoch+1}: avg loss = {avg_loss:.4f}") ``` --- ## Performance on Key Benchmarks R1's performance on the key reasoning benchmarks that define the state of the art: ### AIME 2024 (American Invitational Mathematics Examination) AIME is a 15-problem math competition for high school students. Each problem is a number from 0–999, making verification easy. R1 achieved 79.8% on AIME 2024, compared to 74.4% for o1. ### MATH-500 A subset of 500 problems from the MATH benchmark (competition math from AMC to IMO level). R1: 97.3%, o1: 96.4%. Both are dramatically better than GPT-4o (74.6%). ### Codeforces Competitive programming. R1 reached a Codeforces rating of approximately 1700, placing it in the top 4% of competitive programmers globally. This is solved problems across all difficulty levels, not cherry-picked. ### GPQA Diamond PhD-level science questions (chemistry, biology, physics) that require graduate expertise. R1: 71.5%, o1: 77.3%. Here o1 has an edge. --- ## The Open-Source Significance The release of R1 with open weights changed the competitive landscape in several ways: **Research acceleration**: within weeks, researchers worldwide were studying, extending, and improving on R1's training approach. Papers appeared replicating R1-Zero on smaller models, analyzing what reasoning behaviors emerged, and proposing improvements to GRPO. **Commercial accessibility**: organizations that cannot afford OpenAI o1/o3 API costs can run R1-Distill-7B or 14B locally. For high-volume reasoning tasks, the economics are dramatically different. **Verifiability**: unlike o1, R1's reasoning traces are fully visible. The thinking tokens are in the output. This enables process-level evaluation, interpretability research, and debugging. **Training reproducibility**: the technical report is detailed enough that groups with sufficient compute can replicate R1 training. Several labs have done so. The strategic implications extend beyond AI research. The fact that a Chinese lab produced a model matching US frontier labs at a fraction of the cost, open-sourced it, and published the full methodology upended assumptions about AI development concentration. --- ## Cost Comparison | Aspect | OpenAI o1 | DeepSeek-R1 | |--------|-----------|-------------| | Training cost (reported) | ~$100M (estimated for GPT-4 base) | ~$5.6M (reported) | | Model weights available | No | Yes (open) | | Technical report | System card only | Full methodology | | API cost (per 1M tokens) | ~$15 output | ~$2.19 output | | Self-hosted option | No | Yes (671B MoE, requires 8xH100) | | Distilled small models | No | Yes (1.5B to 32B) | The cost difference is striking. DeepSeek was able to match o1 capability at roughly 1/20th the API cost. Part of this is the underlying model (DeepSeek-V3 uses MoE efficiently), part is the GRPO algorithm being more efficient than PPO, and part is hardware efficiency improvements the team developed. --- :::danger Common Mistake: Assuming R1-Zero Works on Small Models The R1-Zero experiment (pure RL without SFT) works because the base model (DeepSeek-V3, 671B parameters) already has strong pre-trained reasoning capability. Applying pure GRPO to a 7B model without SFT cold start generally fails - the model doesn't have the capacity to discover reasoning spontaneously. For smaller models, always use the SFT cold start + RL pipeline. ::: :::warning GRPO Reward Hacking Like all RL systems, GRPO-trained models can learn to hack the reward function. R1 uses rule-based rewards (math/code verification) which are relatively hack-resistant. If you try to train a reasoning model with LLM-based rewards (e.g., "ask GPT-4 if this reasoning is good"), the model will learn to generate reasoning that looks good to the judge, not reasoning that actually works. Use verifiable reward signals wherever possible. ::: :::tip Using Distilled R1 Models in Production For most engineering use cases, R1-Distill-Qwen-14B or R1-Distill-Qwen-32B is the right choice: strong reasoning capability, runs on accessible hardware (2–4 A100s for 32B), and the thinking tokens are visible so you can debug failures. Full R1 (671B MoE) requires 8+ H100s and is appropriate only for the hardest problems or when you need the maximum capability. The 7B distilled model is excellent for experimentation, testing pipelines, and lower-stakes applications. ::: --- ## Interview Questions and Answers **Q1: What is R1-Zero and why is it significant?** R1-Zero is DeepSeek's model trained with pure reinforcement learning from a pre-trained base, with no supervised fine-tuning on reasoning demonstrations. Its significance is that it demonstrated reasoning behavior can emerge spontaneously from RL with verifiable rewards - the model learned to generate extended thinking, backtrack, and self-verify without ever being shown an example of these behaviors. This challenged the assumption that SFT bootstrapping is necessary for reasoning models. It also provided evidence that reasoning is, in some sense, a natural strategy for a capable model to discover when rewarded for correct answers. **Q2: Explain GRPO and how it differs from PPO.** GRPO (Group Relative Policy Optimization) is a reinforcement learning algorithm that eliminates the need for a separate value/critic network. Standard PPO requires a value network to estimate expected future rewards from intermediate states - for long-horizon reasoning tasks, this is expensive and hard to train accurately. GRPO instead samples G responses per question, computes rewards for all G, then normalizes advantages within the group (each response's advantage is how much better it did than the group average). This within-group normalization provides a meaningful advantage signal without a critic. The policy update then uses standard PPO-style clipping to prevent large policy changes. **Q3: What is the R1 distillation process and what does it achieve?** R1 distillation uses the full R1 model (671B) as a teacher to generate high-quality (problem, thinking, answer) examples. A smaller student model is then fine-tuned on these examples using standard SFT. The student learns to imitate R1's reasoning style and structure without going through the expensive RL training process. The results show remarkable capability preservation: a 7B distilled model achieves 55.5% on AIME 2024, significantly better than what a 7B model trained normally would achieve, and competitive with models much larger. This makes strong reasoning accessible on consumer-grade hardware. **Q4: Why is the open-source release of R1 important for the field?** It's important for multiple reasons. Scientifically: it provided the first fully transparent, reproducible implementation of the o1-style reasoning paradigm, enabling the research community to study, critique, and extend it. Commercially: organizations can deploy R1 or its distilled variants without API costs, making strong reasoning economically accessible. For safety: R1's thinking tokens are visible, enabling process-level evaluation and interpretability research that is impossible with o1. Competitively: it demonstrated that the reasoning capability gap between frontier labs was not as large as assumed, and that efficient training can largely compensate for smaller compute budgets. **Q5: What are the key limitations of DeepSeek-R1 compared to o3?** R1 lags behind o3 on: (1) GPQA Diamond (PhD science questions) - o1 outperforms R1 here, suggesting o3 likely widens the gap on scientific reasoning. (2) ARC-AGI - R1's performance on ARC-AGI tasks that require genuine novel pattern discovery is not as strong as o3's 87.5%. (3) Safety and alignment - R1 was trained primarily on reasoning tasks; its broader safety alignment is not as thoroughly evaluated as OpenAI's models. (4) Multilingual reasoning - R1-Zero had language mixing issues, and while full R1 improved on this, English-Chinese reasoning tasks still show some inconsistency. **Q6: How would you set up a local reasoning pipeline using R1-Distill-Qwen-14B?** Using vLLM for efficient serving: (1) Install vLLM and download the model weights from Hugging Face (`deepseek-ai/DeepSeek-R1-Distill-Qwen-14B`). (2) Start a vLLM server with `python -m vllm.entrypoints.openai.api_server --model deepseek-ai/DeepSeek-R1-Distill-Qwen-14B --max-model-len 32768`. (3) Set temperature to 0.6–0.8 for reasoning tasks (needs diversity for self-consistency). (4) Parse the output by splitting on the `</think>` token - everything before is the thinking, everything after is the answer. (5) For best accuracy: generate 4–8 responses and take majority vote. (6) Hardware requirement: 2x A100 80GB for 14B model at full precision; quantized (AWQ/GPTQ 4-bit) runs on 1x A100 or 2x 3090s. :::tip 🎮 Interactive Playground **Visualize this concept:** Try the **[Monte Carlo Tree Search for LLM Reasoning](/playground/mcts-reasoning)** demo on the EngineersOfAI Playground - no code required. :::
© 2026 EngineersOfAI. All rights reserved.