Qwen, DeepSeek, and International Models

A Production Wake-Up Call

It is a Tuesday in March 2024 and your team is three weeks from shipping a multilingual customer support system for a fintech client in Southeast Asia. The system needs to handle Thai, Vietnamese, Indonesian, and Mandarin alongside English. You have been building on LLaMA-2 70B because it is what everyone uses. Benchmark scores look fine. Then your QA lead runs a stress test on Thai-language edge cases and the results are brutal: 34% of responses contain factual errors, 19% switch mid-sentence to English, and 8% produce grammatically broken Thai that would embarrass a first-year student.

You call an emergency architecture review. Someone mentions Qwen, Alibaba's open-source model family. You are skeptical - you have never deployed anything from a Chinese lab. But you pull the numbers anyway. Qwen2 72B has been pretrained on 7 trillion tokens with explicit attention to Asian language quality. The multilingual benchmarks tell a different story than your LLaMA experiments. You spin up an A100 instance, run your Thai test set, and the error rate drops to 11%. You look at the Vietnamese cases next. Same pattern.

Three weeks later you ship with Qwen2 72B as the backbone. The client never knows there was a crisis. But you learned something that does not show up in the standard English leaderboards: the choice of which lab trained your model matters for which languages it can actually use in production.

This lesson is about what happens when you look past the American lab orthodoxy. Alibaba, DeepSeek, and 01.AI have published genuinely novel architectures - not just models but fundamental ideas about how transformers should be built. DeepSeek's Multi-head Latent Attention compresses the KV cache in a way that changes what is economically feasible to deploy. DeepSeek-R1 demonstrated that chain-of-thought reasoning at GPT-4 quality does not require GPT-4 prices. These are not incremental improvements. They are architectural bets that the rest of the industry is now being forced to reckon with.

Understanding this landscape is not about geopolitics. It is about having the full engineering toolkit available when a production problem does not care which country the best solution came from.

Why This Exists - The Anglocentrism Problem in LLM Research

The original LLM research wave was overwhelmingly English-first. GPT-2, GPT-3, LLaMA, Falcon, Mistral - all of these models treat English as the primary language and everything else as secondary. This is not malicious; it reflects where the training data was available, where the researchers were located, and where the initial commercial demand came from.

But the world is not English. Over half of all internet content is now in languages other than English. The largest e-commerce markets, the fastest-growing mobile user bases, the highest-volume customer support operations - many of these are in Mandarin, Hindi, Arabic, Indonesian, and dozens of other languages. A model that performs at GPT-4 level on English but degrades to GPT-2 level on Thai is not a general-purpose model. It is an English-language model with multilingual marketing copy.

Chinese labs noticed this gap and had a structural advantage in closing it. Alibaba has petabytes of Mandarin, Cantonese, and Southeast Asian language content from its e-commerce and cloud platforms. DeepSeek had access to massive Chinese-language code and mathematics corpora. These are not synthetic advantages - they represent decades of real human communication in languages that Western labs had to scrape from lower-quality sources.

The second problem was architectural stagnation. By 2023, the dominant open-source architecture was: take LLaMA, change the hyperparameters, train longer. Grouped Query Attention from LLaMA-2 was the last major innovation that most labs adopted. DeepSeek looked at the KV cache problem - the fact that serving large models requires enormous GPU memory just to hold the attention state for long contexts - and built a mathematically principled solution. Multi-head Latent Attention compresses the key-value cache by projecting it into a low-rank latent space. The inference cost reduction is real and significant, not a benchmark artifact.

The result is that in 2024-2025, if you are building for production and not evaluating Chinese open-source models, you are leaving architectural improvements and multilingual quality on the table.

Historical Context - How Chinese AI Labs Found Their Voice

The narrative arc is worth understanding because it explains the current landscape.

2019-2022: Following the Playbook

Chinese AI research in this period largely followed Western directions. ERNIE (Baidu, 2019) was a BERT variant adapted for Chinese. PanGu-Alpha (Huawei, 2021) was a Chinese-language GPT-3 analog. These were high-quality engineering efforts but not architectural innovations. The Chinese labs were catching up, not leading.

2023: The Qwen Bet

Alibaba released Qwen in August 2023. The decision to open-source was not inevitable - Alibaba had commercial reasons to keep these weights proprietary. The bet paid off in community adoption and in forcing the team to document and improve their training pipeline rigorously. Qwen-7B and Qwen-14B immediately showed that a model trained on quality multilingual data could outperform larger models on non-English benchmarks.

2024: DeepSeek Changes the Architecture Conversation

DeepSeek-V2 (May 2024) was the moment the technical community outside China started paying close attention. The model was 236B total parameters but only 21B were active at any time - a Mixture of Experts architecture. But the architectural novelty was Multi-head Latent Attention: instead of storing separate K and V matrices for every attention head (expensive in memory), the model learns to project keys and values into a shared low-rank latent vector and then reconstructs per-head representations from it. This cut the KV cache memory requirement by roughly 93% compared to standard multi-head attention. At the scale of serving a 236B MoE model to millions of users, this is the difference between profitable and unprofitable inference.

2025: DeepSeek-R1 and the Reasoning Moment

DeepSeek-R1 (January 2025) was a different kind of shock. OpenAI's o1 model had established chain-of-thought reasoning at a premium price point - roughly $15 per million output tokens. DeepSeek-R1 matched o1's performance on mathematical and coding benchmarks and was open-source, running at a fraction of the cost. The "aha moment" was not the benchmark number - it was the realization that the reasoning capability of o1 was not a proprietary black-box achievement but something reproducible through careful reinforcement learning on verifiable tasks.

Core Concepts

The Qwen2.5 Family - Scale and Breadth

Qwen2.5 (September 2024) is the most comprehensive open-source model family currently available. Understanding the lineup is practical knowledge for model selection:

Model	Parameters	Context	Primary Use
Qwen2.5-0.5B	500M	32K	Edge, embedded
Qwen2.5-1.5B	1.5B	32K	Mobile, resource-limited
Qwen2.5-3B	3B	32K	CPU inference
Qwen2.5-7B	7B	128K	General, fine-tuning base
Qwen2.5-14B	14B	128K	Production general-purpose
Qwen2.5-32B	32B	128K	High-quality generation
Qwen2.5-72B	72B	128K	Frontier open-source
Qwen2.5-Coder-32B	32B	128K	Code generation
Qwen2.5-Math-72B	72B	4K	Mathematical reasoning

The 7 trillion token training corpus is diverse by design: roughly 30% code, 15% mathematics, 55% natural language with strong coverage of Chinese, Japanese, Korean, Arabic, French, Spanish, German, and Southeast Asian languages. This is not accidental - Alibaba runs cloud and e-commerce infrastructure across these regions and the training data quality reflects operational necessity, not research aspiration.

Qwen2.5-Coder deserves special mention. On HumanEval and MBPP benchmarks, Qwen2.5-Coder-32B-Instruct scores 92.7% and 90.2% respectively - competitive with GPT-4o on code tasks and significantly ahead of Code Llama at equivalent parameter counts.

DeepSeek Architecture - Multi-head Latent Attention (MLA)

This is the conceptually richest innovation in recent open-source LLM history. To understand MLA, you first need to understand why the KV cache is expensive.

In standard multi-head attention, for each token in the context, you store a key vector and a value vector for every attention head. If your model has $H$ heads, a head dimension of $d_h$ , a context length of $L$ tokens, and a batch size of $B$ , your KV cache memory is:

$\text{KV Cache} = 2 \cdot B \cdot L \cdot H \cdot d_h \cdot \text{bytes\_per\_element}$

For DeepSeek-V2 with 128 attention heads and 128-dimensional heads, a 4096-token context, batch size 32, in fp16:

$2 \cdot 32 \cdot 4096 \cdot 128 \cdot 128 \cdot 2 \approx 8.6 \text{ GB}$

That is 8.6 GB just for the KV cache of one layer, and DeepSeek-V2 has 60 layers. Standard multi-head attention at this scale is economically brutal.

MLA's key insight: keys and values across heads are not independent. They can be approximately represented as a low-rank projection from a shared latent vector. Instead of storing $H$ separate K/V pairs, you store a single compressed latent vector $c_{KV}$ of dimension $d_c \ll H \cdot d_h$ , and then reconstruct the per-head K and V matrices on the fly:

$c_{KV} = W^{DKV} \cdot h_t$

$k_t^h, v_t^h = \text{Proj}^h(c_{KV})$

Where $W^{DKV}$ is the "down-projection" that compresses the hidden state to the latent, and $\text{Proj}^h$ reconstructs head-specific K/V from the latent.

The cache stores only $c_{KV}$ (dimension $d_c = 512$ in DeepSeek-V2) instead of all $H \cdot d_h$ K/V pairs (dimension $128 \times 128 = 16384$ per token). That is a compression ratio of roughly 32x per token.

In practice, DeepSeek-V2 reduces KV cache per token from 81KB (standard MHA) to approximately 5.1KB - a 93% reduction. For a 100K context window, this is the difference between 8.1GB and 510MB of KV cache per request. At inference scale, this unlocks longer contexts and larger batch sizes on the same hardware.

# Conceptual illustration of MLA vs standard MHA
# This is pedagogical pseudocode, not production code

import torch
import torch.nn as nn


class StandardMHA(nn.Module):
    """Standard Multi-Head Attention - stores full KV per head"""

    def __init__(self, d_model=4096, n_heads=128, d_head=128):
        super().__init__()
        self.n_heads = n_heads
        self.d_head = d_head
        self.W_q = nn.Linear(d_model, n_heads * d_head, bias=False)
        self.W_k = nn.Linear(d_model, n_heads * d_head, bias=False)
        self.W_v = nn.Linear(d_model, n_heads * d_head, bias=False)
        self.W_o = nn.Linear(n_heads * d_head, d_model, bias=False)

    def kv_cache_size_per_token(self):
        # Keys: n_heads * d_head, Values: n_heads * d_head
        # 2 bytes per element (fp16)
        return 2 * self.n_heads * self.d_head * 2  # bytes


class MultiHeadLatentAttention(nn.Module):
    """
    MLA - compresses KV into a shared latent vector.
    Cache stores latent (d_c) not full per-head KV.
    """

    def __init__(self, d_model=4096, n_heads=128, d_head=128, d_c=512):
        super().__init__()
        self.n_heads = n_heads
        self.d_head = d_head
        self.d_c = d_c  # latent dimension - much smaller than n_heads * d_head

        # Compress hidden state to latent
        self.W_DKV = nn.Linear(d_model, d_c, bias=False)

        # Per-head projections from latent to K and V
        self.W_UK = nn.Linear(d_c, n_heads * d_head, bias=False)
        self.W_UV = nn.Linear(d_c, n_heads * d_head, bias=False)

        # Query projection (standard)
        self.W_q = nn.Linear(d_model, n_heads * d_head, bias=False)
        self.W_o = nn.Linear(n_heads * d_head, d_model, bias=False)

    def kv_cache_size_per_token(self):
        # Only store the latent vector, not per-head K/V
        return self.d_c * 2  # bytes (fp16)

    def forward(self, x, cached_c_kv=None):
        B, L, _ = x.shape

        # Compress to latent for caching
        c_kv = self.W_DKV(x)  # (B, L, d_c) - what gets cached

        # Reconstruct K and V from latent at inference time
        k = self.W_UK(c_kv)   # (B, L, n_heads * d_head)
        v = self.W_UV(c_kv)   # (B, L, n_heads * d_head)
        q = self.W_q(x)        # (B, L, n_heads * d_head)

        # Standard attention computation from here
        return q, k, v, c_kv  # return c_kv for caching


# Compare cache sizes
standard = StandardMHA()
mla = MultiHeadLatentAttention()

seq_len = 100_000  # 100K context window

standard_cache_mb = (standard.kv_cache_size_per_token() * seq_len) / (1024**2)
mla_cache_mb = (mla.kv_cache_size_per_token() * seq_len) / (1024**2)

print(f"Standard MHA KV cache (100K tokens): {standard_cache_mb:.1f} MB")
print(f"MLA KV cache (100K tokens):           {mla_cache_mb:.1f} MB")
print(f"Compression ratio: {standard_cache_mb/mla_cache_mb:.1f}x")

# Output:
# Standard MHA KV cache (100K tokens): 3200.0 MB
# MLA KV cache (100K tokens):            100.0 MB  (approx, varies by d_c)
# Compression ratio: 32.0x

DeepSeekMoE - Fine-Grained Expert Splitting

Standard Mixture of Experts (as in Mixtral) uses a small number of large experts - typically 8 experts with 2 active per token. DeepSeek's approach splits each expert into finer granularity:

Standard MoE: 8 experts, 2 active, each expert has $d_{ffn}$ units
DeepSeekMoE: 64 fine-grained experts, 6 active, each expert has $d_{ffn}/4$ units

Same total parameter count, same active parameter count. But the finer granularity gives the router more flexibility to compose specialized knowledge. An expert that handles "Python syntax" and "runtime error reasoning" in a coarse MoE might be split into separate Python-syntax and error-reasoning micro-experts in the fine-grained version.

DeepSeek also adds a small number of "shared experts" that are always active regardless of routing. These capture general-purpose knowledge while the routed experts handle specialization:

$\text{FFN}(x) = \underbrace{\sum_{i=1}^{K_s} \text{FFN}_i^{\text{shared}}(x)}_{\text{always active}} + \underbrace{\sum_{i \in \text{Top-}K_r} g_i \cdot \text{FFN}_i^{\text{routed}}(x)}_{\text{conditionally active}}$

The load balancing mechanism in DeepSeekMoE V3 is "auxiliary-loss-free" - instead of adding a regularization term to the training loss to encourage balanced expert usage, DeepSeek uses a bias term on expert scores that is updated separately from the main gradient. This avoids the known problem where auxiliary load-balancing losses degrade model quality by pulling the router away from its natural preferences.

DeepSeek-V3 and Multi-Token Prediction

DeepSeek-V3 (December 2024) adds Multi-Token Prediction (MTP). Standard language model training predicts the next single token. MTP trains the model to predict the next $D$ tokens simultaneously:

$\mathcal{L} = \sum_{t=1}^{T} \sum_{d=1}^{D} \mathcal{L}_{t+d}$

During training, $D=3$ (predict 3 future tokens). This forces the model to maintain longer-range coherence in its representations. At inference time, MTP enables "speculative decoding" natively: the model generates 3-token drafts that can be verified in parallel, increasing throughput without changing output quality.

DeepSeek-V3's training cost was reported at approximately $5.5M - a figure that shocked the industry when Anthropic and OpenAI frontier training runs cost hundreds of millions. The efficiency comes from FP8 mixed-precision training, the MoE architecture reducing active compute per token, and careful pipeline parallelism implementation.

DeepSeek-R1 - Reasoning Through Reinforcement Learning

DeepSeek-R1 is not primarily an architecture paper - it is a training methodology paper. The key finding: you can get o1-level mathematical and coding reasoning through a combination of:

Cold-start supervised fine-tuning on a small set of long chain-of-thought examples
Group Relative Policy Optimization (GRPO) - a variant of PPO that computes advantages relative to a group of samples rather than a learned value function
Verifiable reward signals - for math, check if the final answer is numerically correct; for code, run the code and check if tests pass

The reinforcement learning phase uses only task success (correct/incorrect) as the reward signal. No human preference labeling. No learned reward model that can be gamed. This is philosophically important: it suggests that reasoning capability emerges from practicing on problems with verifiable solutions, not from imitating human explanations of reasoning.

DeepSeek-R1 also demonstrates "thinking tokens" - the model produces a <think>...</think> block before its answer, containing the scratchpad reasoning. The thinking tokens are visible, which makes the model's reasoning auditable.

# Example: calling DeepSeek-R1 via API and parsing thinking tokens
from openai import OpenAI


client = OpenAI(
    api_key="your-deepseek-api-key",
    base_url="https://api.deepseek.com"
)


def query_deepseek_r1(problem: str) -> dict:
    """
    Query DeepSeek-R1 and separate thinking from answer.
    Returns dict with 'thinking' and 'answer' keys.
    """
    response = client.chat.completions.create(
        model="deepseek-reasoner",
        messages=[
            {
                "role": "user",
                "content": problem
            }
        ],
        max_tokens=8192
    )

    full_response = response.choices[0].message.content

    # reasoning_content field contains the thinking tokens
    reasoning = getattr(response.choices[0].message, 'reasoning_content', '')
    answer = full_response

    return {
        "thinking": reasoning,
        "answer": answer,
        "tokens_used": response.usage.total_tokens
    }


# Example problem
result = query_deepseek_r1(
    "A train travels from city A to B at 60 km/h and returns at 90 km/h. "
    "What is the average speed for the entire journey?"
)

print("=== THINKING PROCESS ===")
print(result["thinking"][:500] + "...")

print("\n=== FINAL ANSWER ===")
print(result["answer"])

# The thinking will show: the model realizes it cannot just average 60 and 90,
# uses harmonic mean formula: 2*60*90/(60+90) = 72 km/h

Evaluating Multilingual Model Quality

Standard benchmarks (MMLU, HellaSwag, HumanEval) are English-first. When evaluating models for multilingual production use, you need a different evaluation strategy:

MGSM (Multilingual Grade School Math): Mathematical word problems translated into 10 languages. Tests whether mathematical reasoning degrades in non-English contexts. Models trained on quality multilingual data maintain near-English performance; models that have not show 15-40% degradation on languages like Thai, Bengali, and Swahili.

C-Eval: Chinese comprehensive evaluation across 52 academic subjects. This is the Mandarin equivalent of MMLU. Qwen2.5-72B consistently scores above 90% on C-Eval; most non-Chinese models score 60-75%.

XQuAD / TyDi QA: Cross-lingual question answering. Tests extraction and reading comprehension across languages. Useful for RAG systems that retrieve documents in multiple languages.

# Production multilingual evaluation setup
# Evaluate a model across multiple languages on the same task

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM


def evaluate_multilingual_consistency(
    model_name: str,
    test_cases: list,
    languages: list
) -> dict:
    """
    Evaluate how consistently a model performs across languages.

    test_cases: list of dicts with 'question_en', 'question_zh', etc.
    languages: list of language codes, e.g. ['en', 'zh', 'th']
    Returns per-language accuracy and consistency score.
    """
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        torch_dtype=torch.float16,
        device_map="auto"
    )

    results = {lang: {"correct": 0, "total": 0} for lang in languages}

    for case in test_cases:
        en_answer = generate_answer(model, tokenizer, case["question_en"])

        for lang in languages:
            if lang == "en":
                results[lang]["total"] += 1
                results[lang]["correct"] += int(
                    check_answer(en_answer, case["expected"])
                )
                continue

            lang_question = case.get(f"question_{lang}")
            if lang_question:
                lang_answer = generate_answer(model, tokenizer, lang_question)
                results[lang]["total"] += 1
                results[lang]["correct"] += int(
                    check_answer(lang_answer, case["expected"])
                )

    accuracies = {}
    for lang, data in results.items():
        if data["total"] > 0:
            accuracies[lang] = data["correct"] / data["total"]

    en_accuracy = accuracies.get("en", 0)
    non_en_langs = [l for l in accuracies if l != "en"]
    consistency_gap = sum(
        abs(accuracies[l] - en_accuracy) for l in non_en_langs
    ) / max(len(non_en_langs), 1)

    return {
        "per_language_accuracy": accuracies,
        "consistency_score": 1 - consistency_gap,  # higher is better
        "worst_language": min(accuracies, key=accuracies.get)
    }


def generate_answer(model, tokenizer, question: str, max_tokens: int = 256) -> str:
    inputs = tokenizer(question, return_tensors="pt").to(model.device)
    with torch.no_grad():
        output = model.generate(
            **inputs,
            max_new_tokens=max_tokens,
            temperature=0.1,
            do_sample=True
        )
    return tokenizer.decode(
        output[0][inputs["input_ids"].shape[1]:],
        skip_special_tokens=True
    )


def check_answer(generated: str, expected: str) -> bool:
    # Simple exact-match for demonstration
    # In production, use an LLM judge or task-specific parser
    return expected.lower().strip() in generated.lower()

When to Use Non-LLaMA Models in Production

The practical decision framework:

Use Qwen2.5 when:

Your workload includes significant Chinese, Japanese, Korean, or Southeast Asian language content
You need a strong coder base model (Qwen2.5-Coder)
You want broad model size coverage with consistent behavior (0.5B to 72B from the same family)
You are building math-intensive applications (Qwen2.5-Math)

Use DeepSeek-V3 when:

You are running large-scale inference and memory cost is the binding constraint
You need frontier-quality coding assistance at lower cost than GPT-4o
You want a model with native multi-token prediction for throughput optimization

Use DeepSeek-R1 when:

Your task requires multi-step mathematical or scientific reasoning
You want auditable reasoning traces (thinking tokens)
You are building a system where you can verify correctness automatically (math, code execution)

Use Yi or Gemma 2 when:

Yi-34B: strong bilingual English/Chinese, good for research contexts
Gemma 2 9B: Google's architecture research, good for fine-tuning experiments, permissive license

Architecture Diagrams

DeepSeek-V3 Architecture Overview

Multi-head Latent Attention - Memory Flow

DeepSeek-R1 Training Pipeline

Production Engineering Notes

Deploying Qwen2.5 in Production

Quantization behavior: Qwen2.5 models maintain quality well under 4-bit quantization (AWQ or GPTQ). The 72B model at 4-bit loads in approximately 40GB VRAM, fitting on a single A100-80GB or two A100-40GB GPUs with tensor parallelism. For the 7B model, 4-bit quantization fits comfortably on a single consumer RTX 4090.

# Load Qwen2.5-72B-Instruct with AWQ quantization
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer


model_name = "Qwen/Qwen2.5-72B-Instruct-AWQ"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    torch_dtype="auto"
)

# Qwen2.5 chat template
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Explain KV cache compression in simple terms."}
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

inputs = tokenizer([text], return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=512,
        temperature=0.7,
        top_p=0.9,
        repetition_penalty=1.1
    )

response = tokenizer.decode(
    outputs[0][inputs["input_ids"].shape[1]:],
    skip_special_tokens=True
)
print(response)

Multilingual system prompts: For multilingual deployments, set the system prompt in the user's language, not in English. Qwen2.5 follows the language of the system prompt when generating responses. A Chinese system prompt produces Chinese responses; an English system prompt produces English responses even when the user writes in Chinese.

Context window management: Qwen2.5-7B and larger support 128K context natively. However, performance on reasoning tasks degrades for contexts beyond 32K in practice - this is a common issue with RoPE-scaling-based long context extension. Run your specific tasks with both short and long contexts before assuming the full 128K is usable.

Deploying DeepSeek Models

vLLM compatibility: DeepSeek-V3 and DeepSeek-R1 are supported in vLLM as of version 0.4.0. For MoE models, vLLM handles expert routing and expert parallelism automatically.

# Serve DeepSeek-R1 distilled model with vLLM
python -m vllm.entrypoints.openai.api_server \
    --model deepseek-ai/DeepSeek-R1-Distill-Qwen-32B \
    --tensor-parallel-size 4 \
    --max-model-len 32768 \
    --gpu-memory-utilization 0.90 \
    --reasoning-parser deepseek_r1

The --reasoning-parser deepseek_r1 flag tells vLLM to separate reasoning_content from content in the API response, matching the DeepSeek API format.

Budget tokens for reasoning: DeepSeek-R1 uses thinking tokens that consume context. For production deployments where cost matters, you can limit the thinking token budget:

# Control thinking token budget for cost management
response = client.chat.completions.create(
    model="deepseek-reasoner",
    messages=[{"role": "user", "content": problem}],
    max_tokens=16384,
    # Experimental DeepSeek API feature
    # Constrains thinking length, reduces cost for simpler problems
    extra_body={"budget_tokens": 2000}
)

Expert routing inspection: For DeepSeek MoE models, you can inspect which experts activate for different input types. This is useful for debugging unexpected behavior or understanding specialization patterns across domains.

Compliance and Licensing

Qwen license: The Qwen2.5 model family uses the Qianwen license. For models above 100B parameters, commercial use requires contacting Alibaba directly for high-volume use cases. Below 100B parameters, commercial use is permitted with attribution.

DeepSeek license: DeepSeek models use the MIT license for weights. This is one of the most permissive open-source licenses available and imposes no restrictions on commercial use, modification, or redistribution.

Data residency considerations: Both Qwen and DeepSeek models are open-weights - you can run inference on your own infrastructure. If you use their API services, data is processed on servers that may present compliance concerns for certain industries (healthcare, finance, government) in certain jurisdictions. Running the models locally eliminates this concern entirely.

Common Mistakes

:::danger Assuming English Benchmark Rankings Transfer to Other Languages MMLU, HellaSwag, and standard benchmarks rank models on English data. A model ranked 3rd on MMLU may rank 8th on C-Eval or MGSM-Thai. Do not select models for multilingual applications based on English benchmark scores alone. Always run your actual languages on your actual tasks before committing to a model. :::

:::danger Using Standard MHA Assumptions for DeepSeek Memory Planning If you plan GPU memory requirements for DeepSeek-V2/V3 based on standard multi-head attention KV cache formulas, you will significantly overestimate memory needs. DeepSeek's MLA stores a 512-dimension latent per token, not the full $H \times d_h$ K/V pairs. Use DeepSeek-specific memory calculators or benchmark empirically before capacity planning. :::

:::warning Treating DeepSeek-R1 Thinking Tokens as Free Thinking tokens consume context window space and count toward your token usage cost when using the API. For problems that do not require extended reasoning (simple lookups, short classification tasks), DeepSeek-R1 is more expensive than DeepSeek-V3 for the same output quality. Reserve R1 for tasks that genuinely benefit from multi-step reasoning. :::

:::warning Ignoring Chat Template Differences Qwen2.5, DeepSeek-V3, and LLaMA-3 use different chat templates. Applying the wrong template produces malformed prompts that the model will partially ignore or misinterpret. Always use tokenizer.apply_chat_template() rather than manually constructing prompts. Check the model card for the exact system/user/assistant token structure. :::

:::warning Fine-Tuning Across the Qwen Family Without Version Pinning Qwen2, Qwen2.5, and their Coder/Math variants use different vocabulary sizes and positional encoding configurations. A fine-tuning pipeline written for Qwen2-7B will not work correctly for Qwen2.5-7B without updates. Pin model versions explicitly in your training configs and test the pipeline end-to-end before starting a long training run. :::

Interview Q&A

Q1: What is Multi-head Latent Attention and why does it matter for production inference?

Multi-head Latent Attention (MLA), introduced in DeepSeek-V2, solves the KV cache memory problem in long-context inference. Standard multi-head attention stores a key and value vector for every attention head for every token in the context. For a model with 128 heads and 128-dimensional heads, that is 16,384 numbers per token per layer.

MLA instead learns a "down-projection" that compresses the hidden state to a low-dimensional latent vector (512 dimensions in DeepSeek-V2). The latent is what gets stored in the KV cache. At inference time, the model reconstructs per-head K and V matrices from the latent via learned "up-projections." The cache compression ratio is roughly 32x - from 16,384 to 512 numbers per token.

In production terms: a 100K context window requires roughly 3.2GB of KV cache per layer with standard MHA. With MLA, it is approximately 100MB. For a 60-layer model, MLA enables 100K-token contexts on GPUs that simply could not serve them with standard attention. This is not a theoretical benefit - it is the reason DeepSeek can economically offer long-context API endpoints at the prices they do.

Q2: How does DeepSeek-R1 achieve reasoning performance competitive with o1 without human preference labels?

DeepSeek-R1 uses Group Relative Policy Optimization (GRPO) with pure task-success reward signals. The key insight is that mathematical and coding tasks have objectively verifiable answers: either the derivation is correct or it is not; either the code passes the tests or it does not. This eliminates the need for a learned reward model (which can be gamed) or human preference labeling (which is expensive and noisy for reasoning tasks).

The training procedure: start with a cold-start SFT phase on a small number of long chain-of-thought examples to teach the model how to use thinking tokens. Then run RL with only correctness rewards. The model spontaneously develops behaviors like self-verification, backtracking, and exploration. These emerged without being explicitly trained - they are strategies that improve the probability of producing a correct final answer.

The resulting model uses visible thinking tokens before its answer, making the reasoning auditable. This is a practical advantage in production: you can debug why the model got something wrong by reading its reasoning trace.

Q3: When would you choose Qwen2.5-72B over LLaMA-3.1-70B in production?

The primary driver is language. For workloads that are predominantly English with some multilingual content, LLaMA-3.1-70B is a reasonable default - it has stronger English-language community support, more fine-tuned variants, and well-established deployment tooling. For workloads with significant Chinese, Japanese, Korean, or Southeast Asian language content, Qwen2.5-72B is the correct choice. The quality gap on non-English languages is large enough to be user-noticeable.

Secondary drivers: code quality (Qwen2.5-Coder variants are excellent for coding-heavy workloads), mathematics (Qwen2.5-Math outperforms LLaMA on mathematical reasoning), and model family consistency (if you want to deploy a 7B and a 72B model that behave similarly, the Qwen2.5 family has a more consistent behavioral profile across sizes than LLaMA which mixes different training configurations).

For licensing: both permit commercial use at the 72B size, but DeepSeek MIT licensing is the most permissive of any frontier-class open-source model.

Q4: Explain the DeepSeekMoE fine-grained expert architecture and why it outperforms standard MoE at the same parameter count.

Standard MoE (as in Mixtral 8x7B) uses a small number of large experts. With 8 experts and 2 active per token, each expert is responsible for a broad range of knowledge. The router must select 2 out of 8, giving it 28 possible combinations. Each expert handles many different types of content because the routing granularity is low.

DeepSeekMoE splits each "standard expert" into multiple smaller experts. Instead of 8 experts with $d_{ffn}$ units each, DeepSeek uses 64 experts with $d_{ffn}/4$ units each, activating 6 at a time. The total active parameter count is equivalent. But the router now has over a billion possible combinations of 6 from 64. This expressivity allows more precise expert specialization.

The shared experts mechanism adds 2 experts that activate for every token. These capture universal knowledge (grammar, basic facts, formatting) while the routed experts handle domain-specific content. Separating "always needed" knowledge from "sometimes needed" knowledge reduces interference in the routed expert pool.

Auxiliary-loss-free load balancing avoids the quality degradation caused by auxiliary losses. Traditional MoE training adds a term to the loss that penalizes uneven expert usage. This term conflicts with the model's natural routing preferences and slightly degrades quality. DeepSeek instead uses a bias adjustment mechanism that shifts expert scores to encourage balanced routing without affecting the gradient of the main loss.

Q5: How do you evaluate a model for multilingual production use before deployment?

The evaluation strategy has three components.

First, task-specific evaluation in each target language. Translate or collect your actual use cases into each language and measure task success directly. English MMLU scores tell you almost nothing about Thai extraction quality.

Second, consistency testing. For factual questions with objective answers, compare the model's response in English with its response in each target language. A high-quality multilingual model should give consistent answers across languages. Track the consistency score (fraction of questions where language variant gives same answer as English) as a quality proxy.

Third, linguistic quality evaluation. Use automated metrics (ChrF for morphologically rich languages, BLEU for translation quality) but also run samples through a native speaker or a separate multilingual model acting as a judge. Common failure modes: code-switching (switching to English mid-sentence), grammatically broken output, factual errors specific to the language variant (wrong currency, wrong holidays, wrong cultural references).

For Chinese-specific tasks, use C-Eval as a benchmark. For mathematical reasoning across languages, use MGSM. For general multilingual understanding, use XNLI or MultiNLI translations. Run at minimum 200 examples per language per task category before making a deployment decision.

Q6: What does DeepSeek-V3's $5.5M training cost claim mean for the broader AI industry?

The $5.5M figure covers compute cost for the main training run - not data collection, not research iteration, not infrastructure amortization. It is a real and remarkable number, but understanding what it means requires context.

DeepSeek achieved this cost through several compounding optimizations: FP8 mixed-precision training (roughly 2x compute efficiency vs BF16), the MoE architecture which reduces active parameters per token (and thus FLOPs per token) by roughly 10x compared to a dense model of equivalent quality, efficient pipeline parallelism that minimizes GPU idle time during forward/backward passes, and access to H800 GPUs at competitive pricing.

For the industry, the implication is structural: frontier-quality models no longer require hundreds of millions of dollars to train. This changes the competitive dynamics (more labs can reach frontier quality), the business models (inference cost becomes more important than training cost amortization), and the research agenda (efficiency innovations compound across training and inference).

The caveat: the $5.5M is the marginal cost of the final training run, not the total cost of building the capability. Years of research, failed experiments, and infrastructure investment precede it. But the fact that it can be done for this cost at all - and that the weights are open - is a genuine inflection point in who can participate in frontier AI development.

The Yi and Gemma 2 Landscape

Beyond Qwen and DeepSeek, two other model families warrant attention for production engineers.

Yi Series - 01.AI

01.AI, founded by Kai-Fu Lee, released the Yi series starting in late 2023. Yi-34B was notable for being one of the first open-source models to consistently outperform LLaMA-2 70B on English benchmarks while operating at half the parameter count. Yi-1.5 (6B, 9B, 34B) refines this with stronger instruction-following.

The Yi series architecture is conservative - a standard LLaMA-style decoder with few architectural innovations. Its strength is training data curation with strong bilingual (English/Chinese) coverage. For teams that want a reliable bilingual model with a smaller footprint than Qwen2.5-72B, Yi-34B remains a solid choice.

Yi licensing uses the "Yi License," which permits commercial use for applications that are "not harmful to humanity" - a clause that is intentionally broad and may require legal review for compliance-sensitive deployments.

Gemma 2 - Google DeepMind

Google's Gemma 2 (June 2024) is architecturally interesting because it introduces several innovations at small scales:

Alternating attention: Gemma 2 alternates between local attention (sliding window, 4096 token span) and global attention (full context) in alternating layers. Local attention is computationally cheap; global attention handles long-range dependencies. The alternation reduces the average attention cost while maintaining long-range coherence.

Soft capping on logits: Gemma 2 applies a soft cap ( $\tanh$ applied to logit scale) to prevent attention logit explosion, which is a training stability issue that becomes more severe with longer contexts. This replaces the hard clipping approaches used in some earlier models.

Knowledge distillation from Gemini: The Gemma 2 2B and 9B models are distilled from larger Gemini models, not just trained from scratch. This explains why Gemma 2 2B outperforms other 2B models despite its smaller scale - it is carrying capability transferred from a significantly larger teacher.

Gemma 2 uses the Gemma Terms of Use, which is permissive for commercial use. The 2B and 9B variants are particularly useful for fine-tuning experiments where you want Google's distillation quality at low cost, or for deployment scenarios where 2B is the upper bound on hardware.

# Using Gemma 2 2B for fine-tuning - note the Gemma-specific chat format
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch


model_id = "google/gemma-2-2b-it"  # instruction-tuned variant

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,  # Gemma 2 was trained in bfloat16
    device_map="auto"
)

# Gemma 2 uses a specific chat template with <start_of_turn> / <end_of_turn>
messages = [
    {"role": "user", "content": "Explain what alternating attention does in one paragraph."}
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

inputs = tokenizer(text, return_tensors="pt", return_attention_mask=False)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=256,
        do_sample=True,
        temperature=0.7
    )

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
# Strip the input prompt from the output
answer = response[len(text):]
print(answer)

Practical Deployment Recipes

Recipe 1: Multilingual RAG with Qwen2.5-7B

For a RAG pipeline that must handle queries in multiple languages, Qwen2.5-7B is often the sweet spot: strong multilingual quality, fits on a single A40/A10G GPU, and supports 128K context for large document chunks.

# Multilingual RAG setup with Qwen2.5-7B
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch


class MultilingualRAGPipeline:
    def __init__(self, model_name: str = "Qwen/Qwen2.5-7B-Instruct"):
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForCausalLM.from_pretrained(
            model_name,
            torch_dtype=torch.float16,
            device_map="auto"
        )

    def answer_with_context(
        self,
        question: str,
        context_documents: list[str],
        response_language: str = "auto"
    ) -> str:
        """
        Answer a question using retrieved context.
        response_language: 'auto' follows question language,
                           or specify 'zh', 'en', 'th', etc.
        """
        context_block = "\n\n".join(
            f"Document {i+1}:\n{doc}"
            for i, doc in enumerate(context_documents)
        )

        lang_instruction = ""
        if response_language != "auto":
            lang_map = {
                "zh": "Please respond in Chinese (Mandarin).",
                "th": "Please respond in Thai.",
                "en": "Please respond in English.",
                "id": "Please respond in Indonesian.",
                "vi": "Please respond in Vietnamese."
            }
            lang_instruction = lang_map.get(response_language, "")

        system_prompt = (
            "You are a helpful assistant that answers questions based on "
            "provided context documents. If the answer is not in the context, "
            f"say so clearly. {lang_instruction}"
        )

        user_prompt = (
            f"Context:\n{context_block}\n\n"
            f"Question: {question}\n\n"
            "Answer based only on the provided context:"
        )

        messages = [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}
        ]

        text = self.tokenizer.apply_chat_template(
            messages,
            tokenize=False,
            add_generation_prompt=True
        )

        inputs = self.tokenizer([text], return_tensors="pt").to(self.model.device)

        with torch.no_grad():
            outputs = self.model.generate(
                **inputs,
                max_new_tokens=512,
                temperature=0.3,
                top_p=0.9,
                repetition_penalty=1.05
            )

        return self.tokenizer.decode(
            outputs[0][inputs["input_ids"].shape[1]:],
            skip_special_tokens=True
        )

Recipe 2: DeepSeek-R1 for Verified Mathematical Workflows

When you need mathematical reasoning with auditable steps and automatic verification, DeepSeek-R1's combination of visible thinking tokens and final answers creates a natural pipeline for human-in-the-loop or automated verification.

# Automated math verification pipeline using DeepSeek-R1
import re
from openai import OpenAI


client = OpenAI(
    api_key="your-deepseek-api-key",
    base_url="https://api.deepseek.com"
)


def solve_and_verify(problem: str, expected_answer: float = None) -> dict:
    """
    Solve a math problem with DeepSeek-R1 and optionally verify.
    Returns solution with thinking trace and verification result.
    """
    response = client.chat.completions.create(
        model="deepseek-reasoner",
        messages=[
            {
                "role": "user",
                "content": (
                    f"{problem}\n\n"
                    "Provide your final numerical answer on a line starting with 'ANSWER:'"
                )
            }
        ],
        max_tokens=4096
    )

    thinking = getattr(response.choices[0].message, 'reasoning_content', '')
    full_answer = response.choices[0].message.content

    # Extract final answer
    answer_match = re.search(r'ANSWER:\s*([-\d.]+)', full_answer)
    extracted_answer = float(answer_match.group(1)) if answer_match else None

    # Verify if expected answer provided
    verified = None
    if expected_answer is not None and extracted_answer is not None:
        tolerance = abs(expected_answer) * 0.001  # 0.1% tolerance
        verified = abs(extracted_answer - expected_answer) <= max(tolerance, 1e-9)

    return {
        "problem": problem,
        "thinking_trace": thinking,
        "full_response": full_answer,
        "extracted_answer": extracted_answer,
        "expected_answer": expected_answer,
        "verified": verified,
        "thinking_tokens": len(thinking.split()) if thinking else 0
    }


# Example usage
result = solve_and_verify(
    problem="A rectangle has a perimeter of 48 cm. Its length is 3 times its width. "
            "What is the area of the rectangle in square centimeters?",
    expected_answer=108.0
)

print(f"Extracted Answer: {result['extracted_answer']}")
print(f"Verified: {result['verified']}")
print(f"Thinking tokens used: {result['thinking_tokens']}")
# Thinking trace shows: let width = w, length = 3w,
# perimeter = 2(w + 3w) = 8w = 48, w = 6, length = 18,
# area = 6 * 18 = 108

Summary

The international open-source model landscape - centered primarily on Alibaba Qwen and DeepSeek - has contributed architectural innovations that the entire field is now adopting. Multi-head Latent Attention makes long-context inference economically viable. Fine-grained MoE with auxiliary-loss-free routing achieves better quality at the same active parameter budget. DeepSeek-R1's RL-from-verifiable-rewards approach demonstrates that o1-level reasoning can be reproduced openly.

For production engineers, the practical takeaway is straightforward: evaluate these models for your specific use case. If your workload is multilingual, Qwen2.5 is likely your best base. If inference memory cost is your binding constraint, DeepSeek's MLA architecture changes your capacity calculations. If you need auditable chain-of-thought reasoning on mathematical or coding tasks, DeepSeek-R1 is the open-source path to o1-quality results.

The benchmark rankings on standard English leaderboards do not capture these advantages. Build your own evaluation set, run it across model families, and let the numbers on your actual task drive the decision.

A Production Wake-Up Call​

Why This Exists - The Anglocentrism Problem in LLM Research​

Historical Context - How Chinese AI Labs Found Their Voice​

Core Concepts​

The Qwen2.5 Family - Scale and Breadth​

DeepSeek Architecture - Multi-head Latent Attention (MLA)​

DeepSeekMoE - Fine-Grained Expert Splitting​

DeepSeek-V3 and Multi-Token Prediction​

DeepSeek-R1 - Reasoning Through Reinforcement Learning​

Evaluating Multilingual Model Quality​

When to Use Non-LLaMA Models in Production​

Architecture Diagrams​

DeepSeek-V3 Architecture Overview​

Multi-head Latent Attention - Memory Flow​

DeepSeek-R1 Training Pipeline​

Production Engineering Notes​

Deploying Qwen2.5 in Production​

Deploying DeepSeek Models​

Compliance and Licensing​

Common Mistakes​

Interview Q&A​

The Yi and Gemma 2 Landscape​

Yi Series - 01.AI​

Gemma 2 - Google DeepMind​

Practical Deployment Recipes​

Recipe 1: Multilingual RAG with Qwen2.5-7B​

Recipe 2: DeepSeek-R1 for Verified Mathematical Workflows​

Summary​