Code and Math Specialized Models
The Day a General Model Failed a Senior Engineer
It was 2 AM when Priya's on-call alert fired. A critical data pipeline had silently corrupted three days of production records. The query responsible was a 400-line SQL transformation she had not written - it was generated by an LLM-assisted tool her team had adopted two months earlier. The model, a capable 13B general-purpose assistant, had confidently produced SQL that looked correct on inspection but subtly mishandled NULL propagation across three joins. The bug only manifested on edge-case data distributions that took three days to appear in production.
Priya spent the next four hours manually auditing every LLM-generated SQL in their codebase. She found eleven more subtle errors. All of them passed basic correctness checks. All of them looked plausible. None of them were caught by the model because the model was optimized for conversation, general knowledge, and instruction following - not for the precise, adversarial reasoning required to write production-correct code.
Six weeks later, her team switched to a code-specialized model for all code generation tasks. The rate of subtle logic errors dropped significantly. Not because the specialized model was smarter in some general sense - it scored lower on trivia and summarization benchmarks than the model it replaced. It scored dramatically higher on the things that mattered: understanding complex function signatures, handling edge cases in type systems, reasoning about database semantics, generating code that actually ran on the first attempt.
This distinction - between a model that is good at everything and a model that is excellent at one critical domain - is the central argument for specialized models. When your production system depends on code correctness or mathematical accuracy, general capability is not sufficient. You need a model trained on the right data, with the right objectives, evaluated on the right benchmarks.
This lesson explains how code and math specialized models work, why they outperform general models on domain tasks, and how to choose and evaluate them for your specific use case. By the end, you will be able to make an informed decision about whether a specialized model is right for your system - and you will know how to measure the answer empirically rather than relying on benchmark rankings alone.
Why This Exists - The Limits of General Pre-training
The Problem With Generalist Training Data
Large language models learn from internet text. The internet contains code - a lot of it. GitHub, Stack Overflow, documentation sites, and tutorial blogs represent a significant fraction of high-quality web text. So why do general-purpose models struggle with code?
The answer is data composition. A general model trained on 15 trillion tokens might include 500 billion tokens of code - roughly 3% of the total. The model learns code as one category among many. It learns syntax, common patterns, and frequently seen idioms. But it does not develop deep mastery of the specific reasoning patterns code requires: tracking variable state across many lines, reasoning about type constraints, handling edge cases in standard library behavior, or understanding the semantics of database operations.
Compare this to a code-specialized model trained on 3 trillion tokens where 80% is code and code-adjacent text (documentation, commit messages, code review discussions, unit tests). The model develops code as its primary competency. The ratio of signal to noise for code-specific reasoning is dramatically higher.
This is the core argument for specialized pre-training: it is not about removing general capability, it is about shifting the distribution of learned expertise toward the domain that matters for your task.
Why Fine-tuning Alone Is Not Enough
A natural response is: "just fine-tune a general model on code." This works to a degree - instruction fine-tuning on code improves a general model's code outputs considerably. But fine-tuning has a fundamental ceiling set by the base model's capabilities.
Pre-training is where the model learns the underlying representations: how variables relate, how control flow works, how mathematical objects compose. Fine-tuning adjusts the model's behavior on top of those representations. If the base representations for code are shallow - because the base model saw proportionally little code during pre-training - fine-tuning cannot fully compensate.
Code-specialized models solve this by establishing deep code representations at pre-training time. Fine-tuning then shapes behavior on top of a foundation that already understands the domain at a structural level.
The Math Problem is Even More Severe
Mathematical reasoning presents an even sharper version of this challenge. Web text contains mathematics - textbooks, papers, homework solutions - but almost all of it is presented as natural language exposition around notation. Models trained on this data learn to describe math fluently. They learn to reproduce familiar derivations. They do not reliably learn to reason through novel multi-step mathematical problems.
This is because mathematical reasoning requires a kind of systematic correctness that statistical pattern matching struggles to approximate. Every step in a proof must be valid. One invalid inference invalidates the conclusion. General models generate plausible-looking math that often contains subtle errors at exactly the steps where human reasoning is most fragile.
Math-specialized models address this by training heavily on formal mathematical text, verified solutions, chain-of-thought reasoning traces, and synthetic problem sets where every step has been verified. The model is not just learning to produce math-like output - it is learning the reasoning patterns that make math correct.
Historical Context - How Code Models Evolved
The First Generation: GitHub Copilot and Codex (2021)
The modern era of code models begins with OpenAI's Codex, released in 2021. Codex was GPT-3 fine-tuned on 159GB of public GitHub code. The key finding: a model trained this way could complete code from natural language descriptions with accuracy that shocked the research community. The HumanEval benchmark, also introduced alongside Codex, measured pass@k accuracy - the probability that at least one of k generated solutions passes all unit tests. Codex-12B achieved 28.8% pass@1 on HumanEval. At the time, this seemed remarkable.
The "aha moment" for the field was not the number itself but what it implied: code could be treated as a formal language amenable to learned completion in ways that prose cannot. Code has unit tests. Unit tests provide ground-truth correctness signals. This created a uniquely rigorous evaluation framework that does not exist for general text generation.
AlphaCode and the Competition Benchmark (2022)
DeepMind's AlphaCode (2022) pushed the frontier to competitive programming. Trained on a massive code corpus and evaluated on Codeforces problems - problems that require genuine algorithmic reasoning, not just pattern matching - AlphaCode reached approximately the 50th percentile of human competitors. This established that code models could tackle problems requiring creative problem decomposition, not just pattern completion.
The Open-Source Code Model Explosion (2023-2024)
The open-source code model ecosystem took off after the LLaMA release demonstrated that capable models could be trained and distributed openly. Within 18 months, the landscape included:
- Code Llama (Meta, 2023): LLaMA 2 continued-pretrained on 500B code tokens, with FIM training for infilling tasks. Released in three sizes (7B, 13B, 34B) with base, instruct, and Python-specialized variants.
- StarCoder and StarCoder2 (BigCode, 2022-2024): Trained on The Stack dataset, a carefully curated multi-language code corpus. StarCoder2 introduced architectural improvements and scaled to 15B parameters with strong multilingual code performance.
- DeepSeek-Coder (2024): Series from DeepSeek AI trained from scratch on 87% code, 10% math, and 3% natural language. DeepSeek-Coder-V2 (2024) achieved state-of-the-art open-source performance on HumanEval, surpassing GPT-4 on several code benchmarks.
- Qwen2.5-Coder (Alibaba, 2024): 0.5B to 72B parameter series with strong multilingual code support and integrated math reasoning.
Math Models: DeepSeek and Qwen Take the Lead
Math-specialized models emerged as a separate track. DeepSeek-Math (2024) demonstrated that training a model explicitly on mathematical reasoning data - with heavy use of chain-of-thought traces and process reward model feedback - produced dramatically better performance on competition math benchmarks like MATH and AMC. Qwen2.5-Math followed with strong performance across school-level through competition-level mathematics.
Core Concepts
What Makes Code Data Special
Code is unlike natural language in several fundamental ways that explain why specialized training helps.
Formal syntax: Code has an unambiguous parse tree. Every token has a precise role determined by the grammar. This creates strong long-range dependencies that differ from prose dependencies.
Executability: Code is verifiable. You can run it and check if it is correct. This enables training signals and evaluation methods unavailable for general text.
Multi-modal structure: Code contains multiple distinct sub-languages within a single file - Python, SQL in docstrings, Markdown in comments, JSON in configuration literals, shell commands in subprocess calls. A code model must handle all of these coherently.
Token distribution: Programming languages use identifiers (variable names, function names) extensively. These appear with high frequency within a file but low frequency across files. Standard BPE tokenization treats self.attention_weights as a sequence of subword tokens. Code-specialized models often use extended vocabularies or modified tokenization that handles identifiers and operators more efficiently.
Fill-in-the-Middle (FIM) Training
One of the most important techniques in code model training is Fill-in-the-Middle, introduced in the Bavarian et al. (2022) paper. Standard language model training trains the model to predict the next token given preceding context. This is perfect for code completion from a cursor position at the end of a file.
But real code editing is different. You frequently need to fill in a function body given the signature above and the usage below. You need to write the middle of a file, not the end. Standard next-token prediction does not train this capability.
FIM training restructures the training objective. Given a document, a random span is extracted as the "middle". The model receives the prefix, a special <FIM_SUFFIX> token, the suffix, a <FIM_MIDDLE> token, and must predict the middle content. Formally:
where are tokens in the middle span. This trains the model to use both preceding and following context when generating - exactly what code completion in an IDE requires.
FIM is now standard in all production code models. Code Llama, StarCoder2, DeepSeek-Coder, and Qwen2.5-Coder all use FIM. When you use a code completion tool in VS Code or Neovim, FIM is what makes the model fill in the gap between your cursor and the next function intelligently.
How Code Models Handle Longer Context
Code tasks require longer context than typical NLP tasks. Understanding a function requires seeing the class definition. Understanding a class requires seeing the imports and module structure. Understanding a bug requires seeing the call stack and potentially multiple files.
Code-specialized models typically extend context length during the continued pre-training phase. Code Llama extended context to 100K tokens using position interpolation techniques. DeepSeek-Coder-V2 supports 128K context. This requires modifications to the position encoding scheme.
The most common approach is RoPE scaling - scaling the base frequency of the Rotary Position Embedding to extend the effective context window without full re-training. For a standard RoPE base frequency , scaling replaces with a larger value (e.g., ) to reduce the rate of rotation and extend the distance over which the model maintains positional coherence.
In practice, code models use a combination of extended position bases, attention sink techniques, and continued pre-training on long-context code documents (large repositories, long files) to develop effective long-range code understanding.
The Vocabulary Question
Standard LLM vocabularies of 32K-65K tokens are not optimized for code. Common code constructs like ->, ::, ===, !=, += may be split into multiple tokens. Variable names like load_balancer_config become sequences of 4-6 tokens. This hurts efficiency (more tokens per unit of code) and can hurt quality (the model must learn compositional semantics for constructs that are atomic in practice).
Code-specialized models address this in several ways:
-
Extended vocabulary: Include common code-specific tokens as single units. Qwen2.5-Coder uses a 151K token vocabulary that includes common operators and frequently occurring identifiers.
-
Byte-level fallback: Ensure rare tokens (unusual identifiers, Unicode operators) are always representable, even as individual bytes, rather than treating them as unknown.
-
Operator preservation: Ensure multi-character operators (
:=,->,**,//) are single tokens rather than split pairs.
The practical effect: code models generate code with fewer token steps than general models for equivalent programs. This improves both speed and, indirectly, quality - the model can "see" more of its own reasoning within the context window.
Math Reasoning: Chain-of-Thought as a First-Class Objective
Math-specialized models address the reasoning problem differently. The key insight from DeepSeek-Math and similar work is that the training objective must reward correct reasoning processes, not just correct final answers.
Consider the difference between these two training signals:
Answer-only training: The model generates "The answer is 42." Loss computed only on the final token.
Process-supervised training: The model generates a full chain of thought with each algebraic step labeled. A process reward model scores each step for correctness. Loss is shaped by step-level rewards, not just the final answer.
The second approach trains the model to reason carefully at each step. It learns to check intermediate results. It learns to recognize when a line of reasoning leads to a contradiction. This is fundamentally different from learning to produce plausible-looking mathematical text.
Formally, process reward training uses a reward signal at each reasoning step :
where is the -th reasoning step. This requires curated training data where human annotators (or verified automated systems) label whether each intermediate step is mathematically valid.
The Model Landscape
Code Models: Current State
Math Models: Current State
The math model ecosystem is more concentrated. Three families dominate:
- DeepSeek-Math (7B): Trained on 120B math tokens mined from the web, augmented with synthetic problem-solution pairs. Achieves 51.7% on MATH benchmark, competing with GPT-4 on this task.
- Qwen2.5-Math (7B/72B): Strong across school, competition, and olympiad-level problems. The 72B variant reaches near-human performance on AMC/AIME.
- NuminaMath: Community-trained model specifically for competition mathematics.
For practical production use, the math models are typically used in one of two modes: as standalone reasoning engines for mathematical computation tasks, or as components in tool-augmented systems where the model writes code that gets executed by a Python interpreter (a pattern that sidesteps much of the hallucination problem in numerical computation).
Evaluation Benchmarks
Understanding which benchmark measures what is critical for choosing a model for your specific use case. Benchmark performance does not automatically translate to production performance.
HumanEval
HumanEval (Chen et al., 2021) contains 164 hand-crafted Python programming problems. Each problem has a function signature, docstring, and a set of unit tests. The metric is pass@k - the probability that at least one of generated solutions passes all unit tests:
where is total samples generated and is correct samples.
HumanEval measures: basic algorithmic reasoning, standard library familiarity, docstring-to-code translation.
HumanEval does NOT measure: large codebase reasoning, multi-file operations, debugging, code review, complex SQL, or any task requiring more than a single function.
Production relevance: Moderate. A model scoring 85%+ on HumanEval is a reasonable baseline for simple function completion tasks. But this benchmark is now widely considered saturated - most state-of-the-art models score above 80%, yet production quality varies significantly.
MBPP
Mostly Basic Python Problems (Austin et al., 2021) contains 500 Python programming problems from beginner to intermediate difficulty, each with three unit tests. MBPP is slightly broader than HumanEval in problem type but similarly limited in scope.
SWE-Bench
SWE-Bench (Jimenez et al., 2024) is a more realistic benchmark: real GitHub issues from popular Python repositories, with the model asked to produce a patch that resolves the issue. Evaluation requires the patch to pass the repository's existing test suite.
This benchmark measures: large codebase navigation, multi-file editing, understanding existing code, debugging from issue descriptions.
SWE-Bench scores are dramatically lower than HumanEval scores for the same models - top models achieve 12-25% resolution rates on the full benchmark, versus 80-90% on HumanEval. This gap illustrates exactly why HumanEval alone is insufficient for evaluating production code models.
MATH Benchmark
The MATH dataset (Hendrycks et al., 2021) contains 12,500 problems from high school mathematics competitions across 7 categories: algebra, geometry, counting, number theory, probability, precalculus, and intermediate algebra. Problems are rated 1-5 for difficulty.
The metric is exact match on the final answer after normalization (handling equivalent forms like and ).
MATH scores correlate strongly with performance on real mathematical reasoning tasks. The difficulty distribution makes it useful for distinguishing models across capability levels - unlike HumanEval, MATH is not yet saturated.
GSM8K
Grade School Math 8K (Cobbe et al., 2021) contains 8,500 grade school math word problems. These are significantly easier than MATH problems and primarily test arithmetic reasoning and word problem comprehension rather than advanced mathematical knowledge.
GSM8K is useful for evaluating models intended for business math, financial calculation, and similar applied numerical tasks.
Benchmark Performance Summary
| Model | HumanEval | MBPP | MATH | GSM8K | Context |
|---|---|---|---|---|---|
| DeepSeek-Coder-V2-Instruct (16B) | 90.2% | 76.2% | 75.7% | 94.9% | 128K |
| Qwen2.5-Coder-7B-Instruct | 88.4% | 83.5% | 67.4% | 91.3% | 128K |
| Qwen2.5-Coder-32B-Instruct | 92.7% | 90.2% | 79.8% | 95.9% | 128K |
| Code Llama 70B Instruct | 72.0% | 62.4% | 29.9% | 74.2% | 100K |
| StarCoder2-15B | 46.3% | 54.4% | 16.1% | 62.1% | 16K |
| LLaMA 3.1 8B Instruct (general) | 72.6% | 68.9% | 51.9% | 84.5% | 128K |
Note: LLaMA 3.1 8B is included as a reference point for a capable general model. The specialized models win on code/math - but the margin varies by task. At the 7-8B scale, specialized models have a clear edge. At larger scales, general models close the gap faster.
Architecture Flow: How a Code Model Processes a Completion Request
Code Examples
Loading and Running DeepSeek-Coder
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_name = "deepseek-ai/deepseek-coder-6.7b-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="auto",
)
# Standard instruction-following format
messages = [
{
"role": "user",
"content": (
"Write a Python function that takes a list of integers and returns "
"all pairs that sum to a target value. Handle edge cases including "
"empty input, duplicate values, and negative numbers."
),
}
]
# Apply chat template - important: each model has its own format
inputs = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
return_tensors="pt",
).to(model.device)
outputs = model.generate(
inputs,
max_new_tokens=1024,
temperature=0.1, # Low temperature for code - determinism matters
do_sample=True,
pad_token_id=tokenizer.eos_token_id,
)
# Decode only the newly generated tokens
generated = outputs[0][inputs.shape[1]:]
print(tokenizer.decode(generated, skip_special_tokens=True))
Fill-in-the-Middle (FIM) Completion
# FIM: complete the function body given signature above and usage below
# Format: <|fim_begin|>prefix<|fim_hole|>suffix<|fim_end|>
# Each model has slightly different FIM token names - check the model card
model_name = "bigcode/starcoder2-7b"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="auto",
)
# The prefix: code before the cursor
prefix = """def calculate_discount(price: float, discount_pct: float) -> float:
\"\"\"
Apply a percentage discount to a price.
Args:
price: Original price in dollars. Must be positive.
discount_pct: Discount as a percentage (0-100).
Returns:
Discounted price. Never returns negative values.
\"\"\"
"""
# The suffix: code after the cursor (the function usage)
suffix = """
# Test cases
assert calculate_discount(100.0, 20.0) == 80.0
assert calculate_discount(50.0, 0.0) == 50.0
assert calculate_discount(100.0, 100.0) == 0.0
assert calculate_discount(100.0, 150.0) == 0.0 # Capped at 0
"""
# StarCoder2 FIM tokens
fim_prefix_token = "<fim_prefix>"
fim_suffix_token = "<fim_suffix>"
fim_middle_token = "<fim_middle>"
fim_input = f"{fim_prefix_token}{prefix}{fim_suffix_token}{suffix}{fim_middle_token}"
inputs = tokenizer(fim_input, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=256,
temperature=0.2,
do_sample=True,
pad_token_id=tokenizer.eos_token_id,
)
generated_tokens = outputs[0][inputs["input_ids"].shape[1]:]
completion = tokenizer.decode(generated_tokens, skip_special_tokens=True)
print("Generated function body:")
print(completion)
Benchmarking a Code Model on Your Own Tasks
This is the most important code example in this lesson. Do not rely on published benchmarks alone. Evaluate on your actual distribution of tasks.
import json
import subprocess
import tempfile
import os
from pathlib import Path
from typing import Optional
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
def generate_code_solution(
model,
tokenizer,
problem_prompt: str,
temperature: float = 0.2,
max_new_tokens: int = 1024,
) -> str:
"""Generate a solution for a coding problem using the model."""
messages = [{"role": "user", "content": problem_prompt}]
inputs = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
return_tensors="pt",
).to(model.device)
with torch.no_grad():
outputs = model.generate(
inputs,
max_new_tokens=max_new_tokens,
temperature=temperature,
do_sample=True,
pad_token_id=tokenizer.eos_token_id,
)
generated = outputs[0][inputs.shape[1]:]
return tokenizer.decode(generated, skip_special_tokens=True)
def extract_python_code(text: str) -> Optional[str]:
"""Extract Python code from a markdown code block."""
if "```python" in text:
start = text.find("```python") + len("```python")
end = text.find("```", start)
return text[start:end].strip()
elif "```" in text:
start = text.find("```") + 3
end = text.find("```", start)
return text[start:end].strip()
# If no code block, assume the whole response is code
return text.strip()
def run_tests_on_solution(solution_code: str, test_code: str) -> dict:
"""
Run unit tests against a generated solution.
Returns dict with passed, failed, error fields.
"""
with tempfile.TemporaryDirectory() as tmpdir:
solution_file = Path(tmpdir) / "solution.py"
test_file = Path(tmpdir) / "test_solution.py"
solution_file.write_text(solution_code)
# Test file imports from the solution
test_content = f"from solution import *\n\n{test_code}"
test_file.write_text(test_content)
result = subprocess.run(
["python", "-m", "pytest", str(test_file), "--tb=short", "-q"],
capture_output=True,
text=True,
cwd=tmpdir,
timeout=30,
)
passed = "passed" in result.stdout
failed = "failed" in result.stdout or "error" in result.stdout.lower()
return {
"passed": passed and not failed,
"stdout": result.stdout,
"stderr": result.stderr,
"returncode": result.returncode,
}
def evaluate_model_on_task_suite(
model_name: str,
task_suite: list[dict],
n_samples: int = 5,
) -> dict:
"""
Evaluate a model on a list of tasks. Each task has:
- prompt: the coding problem description
- test_code: pytest-style test code to evaluate correctness
- task_id: unique identifier
Returns pass@1 and pass@5 estimates.
"""
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="auto",
)
results = []
for task in task_suite:
task_results = []
for sample_idx in range(n_samples):
solution_text = generate_code_solution(
model, tokenizer, task["prompt"], temperature=0.8
)
code = extract_python_code(solution_text)
if code is None:
task_results.append(False)
continue
test_result = run_tests_on_solution(code, task["test_code"])
task_results.append(test_result["passed"])
print(
f" Task {task['task_id']}, sample {sample_idx}: "
f"{'PASS' if test_result['passed'] else 'FAIL'}"
)
results.append({
"task_id": task["task_id"],
"samples": task_results,
"pass_at_1": task_results[0],
"pass_at_k": any(task_results),
})
pass_at_1 = sum(r["pass_at_1"] for r in results) / len(results)
pass_at_k = sum(r["pass_at_k"] for r in results) / len(results)
return {
"model": model_name,
"pass_at_1": pass_at_1,
"pass_at_k": pass_at_k,
"n_samples": n_samples,
"task_results": results,
}
# Example task suite - replace with your actual tasks
example_tasks = [
{
"task_id": "db_query_null_handling",
"prompt": (
"Write a Python function `safe_aggregate(values: list) -> float` that "
"computes the mean of a list of numbers, treating None as missing data "
"to be excluded. Return 0.0 for empty or all-None input."
),
"test_code": """
def test_normal():
assert safe_aggregate([1.0, 2.0, 3.0]) == 2.0
def test_with_none():
assert safe_aggregate([1.0, None, 3.0]) == 2.0
def test_all_none():
assert safe_aggregate([None, None]) == 0.0
def test_empty():
assert safe_aggregate([]) == 0.0
""",
},
]
# Run the evaluation
# results = evaluate_model_on_task_suite(
# "deepseek-ai/deepseek-coder-6.7b-instruct",
# example_tasks,
# n_samples=5,
# )
# print(json.dumps(results, indent=2))
Math Model: Solving Problems with DeepSeek-Math
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import re
def solve_math_problem(problem: str, model_name: str = "deepseek-ai/deepseek-math-7b-instruct") -> str:
"""
Solve a math problem using a specialized math model.
The model generates a chain-of-thought solution.
"""
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="auto",
)
# DeepSeek-Math uses a specific prompt format that triggers CoT
prompt = f"Please solve the following math problem step by step:\n\n{problem}\n\nSolution:"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=2048,
temperature=0.0, # Greedy for math - determinism is critical
do_sample=False,
pad_token_id=tokenizer.eos_token_id,
)
generated = outputs[0][inputs["input_ids"].shape[1]:]
return tokenizer.decode(generated, skip_special_tokens=True)
def extract_final_answer(solution_text: str) -> str:
"""Extract the boxed or labeled final answer from a math solution."""
# Look for \boxed{...} pattern (common in math training data)
boxed_match = re.search(r"\\boxed\{([^}]+)\}", solution_text)
if boxed_match:
return boxed_match.group(1)
# Look for "The answer is X" pattern
answer_match = re.search(
r"(?:the answer is|therefore|=)\s*([0-9\-\+\./]+)",
solution_text,
re.IGNORECASE
)
if answer_match:
return answer_match.group(1)
return "Could not extract final answer"
# Example usage
problem = """
A company's revenue grows by 15% in year 1, declines by 8% in year 2,
and grows by 22% in year 3. If the initial revenue is $500,000,
what is the revenue at the end of year 3? Round to the nearest dollar.
"""
# solution = solve_math_problem(problem)
# print(solution)
# answer = extract_final_answer(solution)
# print(f"\nFinal answer: {answer}")
Deciding: Specialized vs General Model
Production Engineering Notes
Temperature Settings for Code
Code generation requires different temperature settings than conversational tasks. Low temperature (0.0-0.2) produces more deterministic, syntactically correct outputs. High temperature (0.7-1.0) produces more diverse solutions useful when sampling many candidates (pass@k evaluation or best-of-n selection).
In production: use temperature 0.1-0.2 for inline completion (user is waiting), and 0.7-0.8 combined with best-of-3 selection for complex generation tasks where you have time to run multiple samples.
Context Window Management for Code
Long context is available but expensive. A 128K context model that processes a full repository on every request will be slow and costly. In practice:
- Use a retrieval step to identify the most relevant files/functions (BM25 or embedding search over your codebase index).
- Pass only the retrieved context plus the active file to the model.
- Keep the effective working context under 16K tokens for latency-sensitive use cases.
Tools like tree-sitter for AST-based chunking and chroma or qdrant for vector search are standard components in code assistant production stacks.
Handling Code Generation Failures
Never trust generated code without validation. A robust production pipeline always includes:
- Syntax check: Parse the output with Python's
ast.parse()(or equivalent for other languages) before attempting to run it. - Static analysis: Run
pyflakesorrufffor Python,eslintfor JavaScript. - Sandboxed execution: Run generated code in an isolated container with resource limits if it needs to be executed.
- Human review gate: For any code that will be deployed to production, keep a human in the loop regardless of how good your model is.
Quantization Tradeoffs for Code Models
Code models are large. A 7B model at float16 requires 14GB VRAM. A 34B model requires 68GB VRAM. Quantization reduces this significantly:
- 8-bit (bitsandbytes): ~7GB for 7B, minimal quality loss on code tasks
- 4-bit (GPTQ/AWQ): ~4GB for 7B, 2-3% degradation on HumanEval
- GGUF (llama.cpp): CPU-friendly format, useful for local development tools
For production inference, AWQ 4-bit quantization generally offers the best quality-speed-memory tradeoff. For local developer tools (VS Code extensions), GGUF with Q5_K_M quantization is standard.
Common Mistakes
:::danger Using Benchmark Rankings as a Substitute for Evaluation HumanEval and MBPP rankings are published by every model team. They are measured on a fixed test set that the model community has had months to overfit. A model ranking 3rd on HumanEval may outperform the top-ranked model on your specific task. Always evaluate on a sample of your actual task distribution before deploying. :::
:::danger Skipping Syntax Validation on Generated Code
Generated code can contain syntax errors, especially for complex generation tasks or when the model is asked to produce code in an unfamiliar language. Running malformed code in a production environment causes failures that are harder to debug than the original problem. Always run ast.parse() (Python) or equivalent before executing any generated code.
:::
:::warning Using High Temperature for Production Code Completion Temperature 0.8+ produces diverse but often incorrect code. It is appropriate for evaluation (sampling many solutions to measure pass@k) but not for inline completion where the user expects a single, high-quality suggestion. Use temperature 0.1-0.2 for production completion endpoints. :::
:::warning Ignoring the FIM Token Format
Different code models use different special tokens for Fill-in-the-Middle. StarCoder2 uses <fim_prefix>, <fim_suffix>, <fim_middle>. DeepSeek-Coder uses <|fim_begin|>, <|fim_hole|>, <|fim_end|>. Using the wrong format produces garbage output without an obvious error message - the model will just generate confused text. Always check the model card for the specific FIM token format.
:::
:::warning Treating Math Model Output as Ground Truth Even specialized math models make errors, especially on multi-step problems with more than 10-15 algebraic steps. For any mathematically critical computation (financial calculations, scientific results, safety-critical numerics), treat the model output as a draft to be verified, not a ground truth answer. Use the model to generate a solution approach, then verify the final answer with a computer algebra system like SymPy. :::
Interview Q&A
Q1: What is Fill-in-the-Middle training and why does it matter for code completion?
Fill-in-the-Middle (FIM) is a training technique where a span of text is extracted from a document, and the model is trained to predict that span given both the text before it (prefix) and the text after it (suffix). For standard language models trained only on next-token prediction, the model can only complete text at the end of a sequence. FIM training enables the model to fill in gaps in the middle of existing code.
This matters enormously for practical code editing. Real-world coding tasks frequently involve writing the body of a function where the signature is already defined above and the calling code is already written below. Without FIM training, a code model is only useful for appending new code at the end of a file - not for the most common editing patterns IDE users actually perform.
FIM also enables infilling tasks like completing a variable name in the middle of an expression, filling in a missing argument to a function call, or completing a conditional branch while preserving the structure around it.
The technical implementation uses three special tokens: a prefix delimiter, a suffix delimiter, and a middle delimiter. During pre-training, a fraction of training examples (typically 50%) are restructured with this format. The model learns to use suffix context as an additional conditioning signal during generation.
Q2: Why do code-specialized models often outperform much larger general models on coding tasks?
The key factor is training data composition, not model size. A general 70B model trained on 2 trillion tokens where 5% is code has seen roughly 100 billion code tokens. A specialized 7B model trained on 3 trillion tokens where 80% is code has seen 2.4 trillion code tokens - 24 times more code data despite being 10 times smaller in parameter count.
The representations learned during pre-training reflect the data distribution. A model that sees 24x more code develops richer internal representations for programming concepts: variable scope, type constraints, algorithm patterns, API semantics. These representations cannot be easily instilled through fine-tuning alone because fine-tuning adjusts the model's behavior, not its fundamental representational capacity.
There is also a token efficiency argument: code-specialized models use code-optimized vocabularies where common code constructs are single tokens. This means the model can "see" more of a codebase within its context window and its attention can operate over more semantically meaningful units.
The practical implication: for code-intensive production tasks at 7-13B scale, a code-specialized model often outperforms a general model 4-5x its size. At 70B+ scale, general models start to close the gap because their absolute data volume becomes large even at low code fractions.
Q3: How would you evaluate two code models to choose between them for your production use case?
Published benchmarks are a starting point, not an answer. The correct evaluation process:
First, define your task distribution. What languages are involved? What types of problems (function generation, debugging, refactoring, SQL, configuration)? What is the typical input length? What is the acceptable output length? This distribution defines what you need to measure.
Second, build a task suite from your actual usage. Take 50-100 representative examples from your existing codebase, documentation, or user requests. For each, define what a correct answer looks like - ideally as unit tests that can be run automatically.
Third, generate multiple samples per task with each model (n=5 is usually sufficient for pass@k estimation). Use temperature 0.8 for sampling, but also evaluate at temperature 0.2 for your primary metric.
Fourth, measure pass@1 (primary metric for production) and pass@5 (ceiling of what best-of-n selection can achieve). Compute confidence intervals - 50 tasks gives you wide confidence intervals, 200 tasks gives reliable estimates.
Fifth, measure latency. A model that scores 5% higher on pass@1 but is 3x slower may be the worse choice in a latency-sensitive application.
Finally, do qualitative review of the failures. Sometimes the failure modes matter more than the pass rate - a model that fails on predictable, edge-case inputs is more deployable than one that fails randomly.
Q4: What is process reward training and why does it improve mathematical reasoning?
Standard training computes loss only on the final answer. A model trained this way learns to produce answers that look correct, which is different from learning to reason correctly. A model can generate plausible-looking derivations with subtle errors in intermediate steps that happen to produce the right final answer on common problem types while failing on less common ones.
Process reward training (also called process supervision) trains a separate model - the process reward model (PRM) - to evaluate the correctness of each step in a reasoning trace. The main model is then trained with reinforcement learning signals that reward correct intermediate steps, not just correct final answers.
The practical effect is that the model learns to check its own reasoning. It learns to recognize when an algebraic step is invalid. It learns to backtrack when a derivation leads to an implausible intermediate result. This produces qualitatively different behavior on hard multi-step problems where the naive "look plausible" heuristic breaks down.
The challenge is data: process reward training requires annotations at the step level, which is expensive to produce. DeepSeek-Math addressed this by using Monte Carlo rollouts to estimate step-level correctness without human annotation - a clever approach that scales to large datasets.
Q5: When should you NOT use a specialized code model?
There are several scenarios where a general model is the better choice:
Mixed-task applications: If your system needs to handle code generation, document summarization, customer support, and general Q&A within the same session or pipeline, a specialized code model will degrade on non-code tasks. A general model handles the full distribution better.
Very large scale with GPT-4-class capability: At 70B+ parameters, general models like LLaMA 3.1 70B have seen enough code data in absolute terms that the gap with specialized 7B models narrows significantly. If you are already deploying a large general model, the incremental gain from switching to a code-specialized variant may not justify the operational complexity.
Rare or niche languages: Code-specialized models are trained predominantly on popular languages (Python, JavaScript, TypeScript, Java, C++, Go, Rust). For specialized domains like COBOL, PL/SQL, hardware description languages, or domain-specific languages, the training data for specialized models may be no richer than for general models - both have seen limited data.
When latency trumps quality: If you need sub-100ms completion latency, the model size that achieves this may be a 1B-3B general model rather than a 7B specialized model. Measure what fits in your latency budget before optimizing for benchmark quality.
The decision is always empirical: measure on your task, under your constraints, before committing.
Q6: How do code model benchmarks like HumanEval relate to real production performance?
HumanEval measures performance on isolated single-function programming problems with known unit tests, in Python, at beginner-to-intermediate difficulty. Production code generation tasks differ in almost every dimension: they involve existing codebases with complex dependencies, multiple files, real bug reports, underdocumented APIs, legacy patterns, and the need to maintain consistency with surrounding code.
The most reliable evidence of this gap is SWE-Bench, where models that score 85%+ on HumanEval typically resolve only 10-25% of real GitHub issues. The correlation between HumanEval ranking and SWE-Bench ranking exists but is imperfect - some models that rank lower on HumanEval outperform higher-ranked models on SWE-Bench because they generalize differently to realistic tasks.
For practitioners, this means HumanEval is useful for quickly filtering obviously weak models but should never be the final selection criterion. The right approach is to treat published benchmarks as a shortlist and build your own evaluation on representative samples of your actual production tasks.
