Content Generation for Education

Reading time: ~40 min · Interview relevance: High · Target roles: ML Engineer, EdTech Engineer, Curriculum AI Engineer

Opening: The Content Scale Problem

A math teacher creates a great lesson on the chain rule in calculus. She writes five worked examples, twelve practice problems at three difficulty levels, a summary explanation, and a quiz with distractors carefully chosen to reveal common misconceptions. This takes her about four hours. She teaches to 30 students. Divided over a career of 30 years and a class of 30, her content creation effort produces roughly 4 minutes of personalized material per student per concept.

Now consider what it means to build an adaptive learning platform at scale. You need content for 50 mathematical topics across 6 grade levels - worked examples at 3 difficulty levels, practice problems at 4 difficulty levels, explanations at 2 reading levels, quizzes with 4-option MCQs with carefully designed distractors, and hints at 3 scaffold levels. That is roughly 50 × 6 × (3+4+2+1) content items - over 3,000 items - for mathematics alone. A full K-12 curriculum across subjects would require hundreds of thousands of content items to cover adequately. Human content creation at this scale takes years and costs millions.

LLMs change this equation dramatically. GPT-4 can generate a practice problem for "two-step linear equations at 7th grade level" in under a second. It can generate a worked example with intermediate steps, an explanation of a common misconception, and three distractors for an MCQ - all in one prompt. At scale, this opens the possibility of adaptive systems that generate content dynamically rather than serving from a fixed catalog. A student who has exhausted the available problems at a particular difficulty level can be served freshly generated ones rather than hitting a dead end.

The catch is quality. LLMs hallucinate. They generate mathematically incorrect worked examples. They write explanations at the wrong reading level. They produce MCQ distractors that are obviously wrong to any student who knows anything about the topic. They write questions that have multiple correct answers. They generate "educational content" that is technically accurate but pedagogically ineffective - missing the scaffold that helps a struggling student, or the challenge that pushes a capable one.

This lesson covers the full pipeline for AI-generated educational content: content types and quality criteria, automatic question generation (AQG), Bloom's taxonomy alignment, distractor generation for MCQs, reading level adaptation, hallucination mitigation, and the human review pipeline that sits between generation and delivery to students.

Why This Exists: The Content Bottleneck in Adaptive Systems

Static content catalogs create a ceiling on the personalization adaptive learning systems can achieve. A system can sequence and recommend content from its catalog adaptively, but if the catalog has only 20 practice problems at a given difficulty level, students who need 50 practice problems will run out. The system then either repeats problems (reducing value) or levels the student up prematurely.

Dynamic content generation breaks this ceiling. A student who needs 200 practice problems on two-step equations can receive 200 distinct problems generated on demand, each with worked solutions and hints. A student learning at a 4th-grade reading level can receive explanations of 8th-grade concepts automatically adapted to accessible language. A student preparing for a specific exam can receive questions in the style of that exam, generated from the same topic coverage.

The secondary use case is content creation tools for teachers. A teacher can describe the learning objective they want to assess, specify the student level and cognitive level (from Bloom's taxonomy), and receive 10 candidate questions in seconds. The teacher reviews, edits, and selects rather than writing from scratch. This does not eliminate the teacher's role - it amplifies their capacity.

Copyright and licensing are real constraints in this domain. Human-authored educational content from textbooks is copyrighted. Generating "inspired by" content avoids this, but generated content must be verified to not reproduce copyrighted material verbatim. Training data for LLMs often includes educational content, and output may inadvertently reproduce it.

Historical Context: From Templates to Transformers

Pre-LLM AQG: The field of automatic question generation dates to the 1970s, when rule-based systems transformed sentences into questions by moving wh-words and inverting subject-verb order. "The mitochondria generates ATP" became "What does the mitochondria generate?" These systems were brittle and grammar-focused - they created syntactically correct questions but not necessarily pedagogically useful ones.

2013-2018 - Neural AQG: Sequence-to-sequence models (attention-based LSTMs) enabled end-to-end generation of questions from passages. The SQuAD dataset (2016) provided training data: passages and human-generated questions, which could be inverted - given a passage and answer, generate the question. Quality improved but models still struggled with multi-sentence reasoning.

2019-2022 - Pre-trained LMs for Education: T5 and GPT-2/3 fine-tuned on educational content demonstrated strong question generation. The Wiley AQG system (2021) used T5 fine-tuned on textbook question-answer pairs and deployed in production.

2023+ - Instruction-Tuned LLMs: GPT-4, Claude, and Llama-2-chat can generate educational content following complex natural language instructions: "Generate a Bloom's Level 3 (Apply) question about photosynthesis for 8th graders, with an answer and two plausible distractors." This instruction-following capability dramatically lowered the engineering burden for educational content generation.

Core Concepts

Educational Content Types

Educational content spans a hierarchy of cognitive demands. Understanding this is prerequisite to generating it correctly:

Factual recall items: "What is the formula for the area of a circle?" Tests memory. Bloom's Level 1 (Remember). These are easy to generate but have limited educational value if overused.

Explanations: Describe a concept in accessible language with examples. Good explanations move from concrete to abstract, use analogies, and anticipate misconceptions. These are harder for LLMs to generate correctly because they require pedagogical structure, not just factual accuracy.

Worked examples: Show the step-by-step solution to a problem with each step explained. The research on "worked example effect" (Sweller, Cooper 1985) shows these are among the most effective learning tools for novices. LLMs generate the steps but often skip the hardest step or fail to explain the reasoning behind each step.

Practice problems: Problems requiring the student to produce a solution. Quality requires correct answers, appropriate difficulty calibration, and - for constructed response problems - a rubric or model solution.

Multiple Choice Questions (MCQs): One correct answer plus 3-4 distractors. The distractors are the hard part: they must be plausible enough to probe understanding but clearly wrong given mastery. Well-designed distractors target specific misconceptions.

Analogies: Explain a new concept using a familiar one. "Electricity is like water in pipes" - useful for intuition building. Hard for LLMs because effective analogies require knowing what the student already understands.

Socratic questions: Open-ended questions designed to prompt reflection, not test recall. "Why might an increase in supply lower price?" These are rarely generated correctly by LLMs because they require understanding of what a student is likely to be confused about.

Bloom's Taxonomy Alignment

Bloom's Revised Taxonomy (Anderson et al., 2001) provides a six-level hierarchy:

Level	Verb examples	Example question
1 - Remember	recall, identify	What is Newton's 2nd law?
2 - Understand	explain, summarize	Explain what F=ma means in your own words
3 - Apply	solve, use	Given F=10N and m=2kg, find acceleration
4 - Analyze	compare, distinguish	Compare Newton's 2nd law to Hooke's law
5 - Evaluate	assess, justify	Is Newton's 2nd law a good model for subatomic particles?
6 - Create	design, construct	Design an experiment to verify Newton's 2nd law

LLMs are good at generating Levels 1-3 (recall, explanation, application) and weaker at Levels 4-6 (analysis, evaluation, creation) because higher levels require more nuanced understanding of the domain and the student's cognitive state.

Quality Criteria for Educational Content

Accuracy: The content must be factually and mathematically correct. This is the hardest criterion for LLMs - they can produce confident, fluent incorrect mathematical claims.

Age-appropriateness: Vocabulary, sentence complexity, and abstraction level must match the target age. An explanation of photosynthesis for a 5th grader differs from one for a 10th grader.

Scaffolding: Good educational content does not assume the student already understands the hard part. It builds from what the student knows to what they need to know, explicitly bridging the gap.

Alignment with learning objective: A question about photosynthesis should test photosynthesis, not reading comprehension or vocabulary knowledge (unless those are the learning objectives).

Uniqueness and diversity: In a large set of generated problems, problems should not be trivially similar to each other. "2x + 3 = 7" and "2x + 5 = 9" are the same structural problem with different constants.

Distractors probe misconceptions: For MCQs, wrong answers should not be randomly wrong - they should be plausible to a student with a specific misconception. "32 meters" as a distractor for a free-fall problem is better than "5 meters" because it targets the common error of using the wrong formula.

Automatic Question Generation (AQG)

The dominant paradigm for AQG from text: given a passage $p$ and an answer span $a$ , generate a question $q$ such that $q$ is answerable from $p$ and has answer $a$ .

Modern approach using instruction-following LLMs:

Passage: {passage text}
Answer: {answer phrase or sentence}
Generate a question for the passage that has this answer.
The question should be at Bloom's Level {level}.
The question should be appropriate for {grade level} students.

The reverse direction also works: given a topic and learning objective, generate both question and answer without a source passage. This is riskier for hallucination but more flexible for synthetic dataset generation.

Distractor generation is the hardest part of MCQ generation. Approaches:

Semantic similarity distractors: Find semantically similar but incorrect terms or values from a knowledge base. For "What is the capital of France?", distractors "Berlin", "Madrid", "Rome" are geographically and taxonomically similar.
Misconception-based distractors: Predict what incorrect answer a student with a specific misconception would give. Requires a misconception catalog for the domain.
LLM-generated distractors with constraint: Prompt the LLM to generate distractors that are "plausible to a student who has not yet mastered this concept but are clearly wrong to a student who has."

Reading Level Adaptation

Different students need the same concept explained at different linguistic complexity levels. Readability is typically measured by:

Flesch-Kincaid Grade Level: $FK = 0.39 \left(\frac{\text{words}}{\text{sentences}}\right) + 11.8 \left(\frac{\text{syllables}}{\text{words}}\right) - 15.59$

Gunning Fog Index: $FOG = 0.4 \left(\frac{\text{words}}{\text{sentences}} + 100 \cdot \frac{\text{complex words}}{\text{words}}\right)$

where complex words have three or more syllables.

LLMs can rewrite content at a target reading level, but the key challenge is semantic preservation - the rewrite should convey the same meaning at a simpler level, not lose key information. A common failure: simplification removes the very information needed to understand the concept.

Hallucination Mitigation in Educational Content

Hallucination in educational content is more harmful than in general LLM use because students trust educational sources. A student who receives an incorrect worked example and practices from it is building wrong understanding.

Three mitigation layers:

Constrained generation: Require the LLM to cite the steps of its reasoning and show intermediate calculations inline. Errors in intermediate steps are easier to detect.
Automated fact-checking: For mathematical content, parse and execute the arithmetic. For factual content, query a knowledge base for entity verification.
Human review pipeline: All AI-generated content should pass through human review before student-facing deployment. The review workflow should prioritize flagged content (where automated checks found inconsistencies) and sample from reviewed content.

Mermaid Diagram: Educational Content Generation Pipeline

Code Examples

Bloom's Taxonomy Question Generator

import json
from typing import Dict, List, Optional
from dataclasses import dataclass

BLOOMS_LEVELS = {
    1: {"name": "Remember", "verbs": ["recall", "identify", "list", "name", "define"]},
    2: {"name": "Understand", "verbs": ["explain", "describe", "summarize", "paraphrase", "interpret"]},
    3: {"name": "Apply", "verbs": ["solve", "calculate", "use", "demonstrate", "apply"]},
    4: {"name": "Analyze", "verbs": ["compare", "contrast", "distinguish", "examine", "break down"]},
    5: {"name": "Evaluate", "verbs": ["assess", "judge", "justify", "critique", "recommend"]},
    6: {"name": "Create", "verbs": ["design", "construct", "formulate", "develop", "compose"]},
}

QUESTION_GENERATION_PROMPT = """You are an expert curriculum developer creating educational assessment questions.

Topic: {topic}
Subject: {subject}
Grade Level: {grade_level}
Bloom's Taxonomy Level: {blooms_level} - {blooms_name}
Action verbs for this level: {blooms_verbs}
Question Type: {question_type}

Generate {n_questions} questions at exactly Bloom's Level {blooms_level} ({blooms_name}).
The questions must require students to {blooms_action} rather than just recall.

For each question include:
- The question text
- The correct answer
- A brief explanation of why this answer is correct
- If MCQ: three plausible distractors that target common misconceptions

Common misconceptions about this topic to target with distractors:
{misconceptions}

Return as a JSON array of question objects with keys:
question, answer, explanation, distractors (list of 3 strings, only for MCQ)

JSON array:"""

@dataclass
class GeneratedQuestion:
    question: str
    answer: str
    explanation: str
    distractors: List[str]
    topic: str
    blooms_level: int
    grade_level: str
    question_type: str


def generate_questions(
    topic: str,
    subject: str,
    grade_level: str,
    blooms_level: int,
    question_type: str,  # "mcq", "short_answer", "worked_example"
    n_questions: int,
    misconceptions: List[str],
    llm_client,
    model: str = "gpt-4o"
) -> List[GeneratedQuestion]:
    """
    Generate Bloom's taxonomy-aligned educational questions.

    Args:
        topic: specific topic (e.g., "two-step linear equations")
        subject: subject area (e.g., "mathematics")
        grade_level: target grade (e.g., "7th grade")
        blooms_level: Bloom's level 1-6
        question_type: "mcq", "short_answer", or "worked_example"
        n_questions: number of questions to generate
        misconceptions: known misconceptions to target with distractors
        llm_client: initialized OpenAI client

    Returns:
        list of GeneratedQuestion objects
    """
    level_info = BLOOMS_LEVELS[blooms_level]
    misconceptions_str = "\n".join(f"- {m}" for m in misconceptions) if misconceptions else "None specified"

    # Map Bloom's level to action language
    blooms_actions = {
        1: "recall and identify specific facts",
        2: "explain and summarize concepts in their own words",
        3: "apply procedures to solve new problems",
        4: "analyze relationships and make comparisons",
        5: "evaluate and justify choices among alternatives",
        6: "create original work or design novel solutions"
    }

    prompt = QUESTION_GENERATION_PROMPT.format(
        topic=topic,
        subject=subject,
        grade_level=grade_level,
        blooms_level=blooms_level,
        blooms_name=level_info["name"],
        blooms_verbs=", ".join(level_info["verbs"]),
        blooms_action=blooms_actions[blooms_level],
        question_type=question_type,
        n_questions=n_questions,
        misconceptions=misconceptions_str
    )

    response = llm_client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        temperature=0.7,
        max_tokens=2000,
        response_format={"type": "json_object"}
    )

    try:
        raw = json.loads(response.choices[0].message.content)
        items = raw if isinstance(raw, list) else raw.get("questions", [])
    except (json.JSONDecodeError, KeyError):
        return []

    questions = []
    for item in items:
        if not isinstance(item, dict):
            continue
        q = GeneratedQuestion(
            question=item.get("question", ""),
            answer=item.get("answer", ""),
            explanation=item.get("explanation", ""),
            distractors=item.get("distractors", []),
            topic=topic,
            blooms_level=blooms_level,
            grade_level=grade_level,
            question_type=question_type
        )
        if q.question and q.answer:
            questions.append(q)

    return questions

MCQ Distractor Generation

from typing import List, Tuple
import json

DISTRACTOR_PROMPT = """You are an expert assessment designer creating distractors for a multiple choice question.

Subject: {subject}
Topic: {topic}
Grade Level: {grade_level}

Question: {question}
Correct Answer: {correct_answer}

Generate exactly 3 distractors (wrong answers) for this question.
Rules for good distractors:
1. Each distractor must be WRONG but PLAUSIBLE to a student who has a specific misconception
2. Each distractor should be similar in length and format to the correct answer
3. Do NOT use "All of the above", "None of the above", or obviously silly answers
4. Explain which misconception each distractor targets

Format: JSON array of objects with keys "distractor" and "misconception_targeted"

JSON:"""

def generate_distractors(
    question: str,
    correct_answer: str,
    subject: str,
    topic: str,
    grade_level: str,
    llm_client,
    model: str = "gpt-4o"
) -> List[Tuple[str, str]]:
    """
    Generate MCQ distractors targeting specific misconceptions.

    Returns:
        List of (distractor_text, misconception_targeted) tuples
    """
    prompt = DISTRACTOR_PROMPT.format(
        subject=subject,
        topic=topic,
        grade_level=grade_level,
        question=question,
        correct_answer=correct_answer
    )

    response = llm_client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        temperature=0.5,
        max_tokens=500,
        response_format={"type": "json_object"}
    )

    try:
        raw = json.loads(response.choices[0].message.content)
        items = raw if isinstance(raw, list) else raw.get("distractors", [])
        return [(item["distractor"], item["misconception_targeted"])
                for item in items if "distractor" in item]
    except (json.JSONDecodeError, KeyError):
        return []


def validate_mcq(question: str, correct_answer: str, distractors: List[str]) -> dict:
    """
    Validate MCQ quality: check distractors are not correct,
    not too similar to each other, not obviously wrong.
    """
    issues = []

    # Check we have the right number of distractors
    if len(distractors) < 3:
        issues.append("fewer than 3 distractors")

    # Check for answer duplicates (distractor == correct answer)
    for d in distractors:
        if d.lower().strip() == correct_answer.lower().strip():
            issues.append(f"distractor '{d}' matches correct answer")

    # Check for obvious placeholder content
    placeholders = ["option a", "option b", "none of the above", "all of the above"]
    for d in distractors:
        if any(p in d.lower() for p in placeholders):
            issues.append(f"distractor '{d}' contains placeholder text")

    # Check for very short distractors (likely low quality)
    for d in distractors:
        if len(d.split()) < 2:
            issues.append(f"distractor '{d}' is very short (likely low quality)")

    return {
        "valid": len(issues) == 0,
        "issues": issues,
        "distractor_count": len(distractors)
    }

Reading Level Classifier and Adapter

import re
import math
from typing import Optional
import nltk

def count_syllables(word: str) -> int:
    """Approximate syllable count using vowel-group heuristic."""
    word = word.lower().strip(".,!?;:'\"")
    if len(word) <= 3:
        return 1
    word = re.sub(r"[^aeiouy]", " ", word)
    syllables = len(word.split())
    return max(1, syllables)

def flesch_kincaid_grade(text: str) -> float:
    """
    Flesch-Kincaid Grade Level formula.
    Approximates US grade level needed to understand the text.
    """
    sentences = [s.strip() for s in re.split(r'[.!?]+', text) if s.strip()]
    words = text.split()

    if not sentences or not words:
        return 0.0

    n_sentences = len(sentences)
    n_words = len(words)
    n_syllables = sum(count_syllables(w) for w in words)

    asl = n_words / n_sentences         # Average sentence length
    asw = n_syllables / n_words         # Average syllables per word

    fk = 0.39 * asl + 11.8 * asw - 15.59
    return round(fk, 1)

def gunning_fog(text: str) -> float:
    """Gunning Fog readability index."""
    sentences = [s.strip() for s in re.split(r'[.!?]+', text) if s.strip()]
    words = text.split()

    if not sentences or not words:
        return 0.0

    n_sentences = len(sentences)
    n_words = len(words)
    complex_words = sum(1 for w in words if count_syllables(w) >= 3
                       and not w.endswith(('es', 'ed', 'ing')))

    fog = 0.4 * (n_words / n_sentences + 100 * complex_words / n_words)
    return round(fog, 1)


READING_LEVEL_ADAPT_PROMPT = """You are an expert educational content writer.

Rewrite the following educational explanation at a {target_grade} reading level.

Original text (written at approximately grade {current_grade}):
---
{text}
---

Requirements:
1. Preserve ALL factual information - do not simplify by removing key concepts.
2. Use shorter sentences (target: {target_sentences} words average).
3. Replace technical vocabulary with simpler synonyms where possible, but define unavoidable technical terms inline.
4. Use concrete examples and analogies familiar to {target_grade} students.
5. Do not talk down to students - just use accessible language.

Return only the rewritten text, no preamble.

Rewritten text:"""

def adapt_reading_level(
    text: str,
    target_grade: int,
    llm_client,
    model: str = "gpt-4o"
) -> dict:
    """
    Adapt educational content to a target reading level.

    Args:
        text: original educational content
        target_grade: target US grade level (1-12)
        llm_client: initialized OpenAI client

    Returns:
        dict with adapted_text, original_fk, adapted_fk
    """
    original_fk = flesch_kincaid_grade(text)

    target_sentences_map = {
        range(1, 4): 8,    # Grades 1-3: 8 words avg
        range(4, 7): 12,   # Grades 4-6: 12 words avg
        range(7, 10): 16,  # Grades 7-9: 16 words avg
        range(10, 13): 20  # Grades 10-12: 20 words avg
    }
    target_sentences = 15
    for grade_range, target_len in target_sentences_map.items():
        if target_grade in grade_range:
            target_sentences = target_len
            break

    prompt = READING_LEVEL_ADAPT_PROMPT.format(
        target_grade=f"{target_grade}th grade",
        current_grade=round(original_fk),
        text=text,
        target_sentences=target_sentences
    )

    response = llm_client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        temperature=0.3,
        max_tokens=len(text.split()) * 2
    )

    adapted_text = response.choices[0].message.content.strip()
    adapted_fk = flesch_kincaid_grade(adapted_text)

    return {
        "adapted_text": adapted_text,
        "original_fk_grade": original_fk,
        "adapted_fk_grade": adapted_fk,
        "target_grade": target_grade,
        "adaptation_successful": abs(adapted_fk - target_grade) < 2.0
    }

Educational Content Quality Scorer

from typing import Dict, List
import re

class ContentQualityScorer:
    """
    Automated quality scorer for AI-generated educational content.
    Checks accuracy signals, readability alignment, and structural quality.
    """

    def __init__(self):
        self.fk_grade = flesch_kincaid_grade  # From above

    def score_explanation(
        self,
        text: str,
        target_grade: int,
        topic: str
    ) -> Dict:
        """Score an AI-generated explanation for educational quality."""
        scores = {}

        # 1. Readability alignment
        fk = self.fk_grade(text)
        grade_error = abs(fk - target_grade)
        scores['readability'] = max(0, 1 - grade_error / 5)

        # 2. Length appropriateness
        word_count = len(text.split())
        target_words = 100 + target_grade * 15  # Heuristic: higher grade = longer
        length_ratio = word_count / target_words
        scores['length'] = 1 - abs(1 - length_ratio) * 0.5

        # 3. Structural quality
        has_example = any(phrase in text.lower() for phrase in
                         ["for example", "such as", "for instance", "consider"])
        has_definition = any(phrase in text.lower() for phrase in
                            ["is defined as", "means that", "refers to", "is when"])
        scores['has_example'] = float(has_example)
        scores['has_definition'] = float(has_definition)

        # 4. Topic mention
        topic_words = topic.lower().split()
        text_lower = text.lower()
        topic_coverage = sum(1 for w in topic_words if w in text_lower) / len(topic_words)
        scores['topic_coverage'] = topic_coverage

        # 5. Red flags (hallucination signals in math content)
        math_claim_pattern = r'(\d+)\s*[+\-*/]\s*(\d+)\s*=\s*(\d+)'
        for match in re.finditer(math_claim_pattern, text):
            a, op_str, b, result = int(match.group(1)), match.group(0), int(match.group(2)), int(match.group(3))
            # Could validate arithmetic here
        scores['math_errors_detected'] = 0  # Placeholder for arithmetic validator

        # Overall score
        scores['overall'] = (
            0.25 * scores['readability'] +
            0.15 * scores['length'] +
            0.20 * scores['has_example'] +
            0.15 * scores['has_definition'] +
            0.25 * scores['topic_coverage']
        )

        return scores

    def score_mcq(
        self,
        question: str,
        correct_answer: str,
        distractors: List[str]
    ) -> Dict:
        """Score MCQ quality."""
        scores = {}

        # Distractor count
        scores['has_enough_distractors'] = float(len(distractors) >= 3)

        # Distractor length consistency
        answer_len = len(correct_answer.split())
        distractor_lens = [len(d.split()) for d in distractors]
        if distractor_lens:
            avg_distractor_len = sum(distractor_lens) / len(distractor_lens)
            length_consistency = 1 - abs(answer_len - avg_distractor_len) / max(answer_len, 1)
            scores['length_consistency'] = max(0, length_consistency)
        else:
            scores['length_consistency'] = 0.0

        # Question has question mark
        scores['proper_question_format'] = float(question.strip().endswith('?'))

        # No duplicates among options
        all_options = [correct_answer.lower()] + [d.lower() for d in distractors]
        scores['no_duplicate_options'] = float(len(set(all_options)) == len(all_options))

        scores['overall'] = (
            0.30 * scores['has_enough_distractors'] +
            0.25 * scores['length_consistency'] +
            0.20 * scores['proper_question_format'] +
            0.25 * scores['no_duplicate_options']
        )

        return scores

Automated Fact-Checking Pipeline

import sympy
from sympy import symbols, solve, simplify
from typing import Tuple, Optional
import re

def validate_math_expression(expression: str) -> Tuple[bool, Optional[str]]:
    """
    Validate a mathematical equality or expression.
    Returns (is_valid, error_message).
    """
    # Remove LaTeX formatting
    clean = expression.replace('$$', '').replace('$', '')
    clean = re.sub(r'\\[a-z]+', '', clean)  # Remove LaTeX commands
    clean = clean.replace('^', '**').replace('×', '*').replace('÷', '/')

    # Try to parse and evaluate
    try:
        if '=' in clean:
            left, right = clean.split('=', 1)
            left_val = sympy.sympify(left.strip())
            right_val = sympy.sympify(right.strip())
            diff = sympy.simplify(left_val - right_val)
            is_correct = diff == 0
            if not is_correct:
                return False, f"LHS ({left.strip()}) != RHS ({right.strip()}): difference = {diff}"
        return True, None
    except Exception as e:
        return None, f"Parse error: {str(e)}"  # Cannot validate, not necessarily wrong


def extract_math_claims(text: str) -> list:
    """Extract mathematical equalities from educational text."""
    # Pattern: numbers and operators around an equals sign
    pattern = r'[$]?[^$\n]*[=][^$\n]*[$]?'
    candidates = re.findall(pattern, text)
    # Filter to ones that look like equations
    return [c for c in candidates if any(d.isdigit() for d in c)]


def fact_check_educational_content(text: str) -> dict:
    """
    Automated fact-checking for AI-generated educational content.
    Currently validates mathematical expressions.
    Extend with knowledge base lookups for factual claims.
    """
    math_claims = extract_math_claims(text)
    validation_results = []

    for claim in math_claims:
        is_valid, error = validate_math_expression(claim)
        if is_valid is False:
            validation_results.append({
                "claim": claim,
                "status": "INCORRECT",
                "error": error
            })
        elif is_valid is True:
            validation_results.append({
                "claim": claim,
                "status": "CORRECT",
                "error": None
            })
        else:
            validation_results.append({
                "claim": claim,
                "status": "UNVERIFIABLE",
                "error": error
            })

    errors = [r for r in validation_results if r["status"] == "INCORRECT"]
    return {
        "total_math_claims": len(math_claims),
        "errors_found": len(errors),
        "all_correct": len(errors) == 0,
        "error_details": errors,
        "requires_human_review": len(errors) > 0
    }

Production Engineering Notes

Human review is not optional before student-facing deployment. AI-generated content should always pass through human expert review before being presented to students. The volume of generated content makes comprehensive review impractical - design a tiered review system: automated quality scoring filters obvious problems, human review samples at rate $k$ (e.g., 10% of passing items) and reviews 100% of flagged items.

Version and track all generated content. When a student encounters a problem, record the exact generated text, the model version, the prompt, and the timestamp. If a content error is discovered, you need to identify all students who encountered it and whether corrective action is needed.

Set Bloom's level thresholds for each use case. Practice drills are appropriately Bloom's Level 1-3. Formative assessments should include Level 3-4 questions. Summative assessments for higher grades should reach Level 4-5. If your generation system produces only Level 1-2 content regardless of the requested level, your adaptive system will under-challenge capable students.

Monitor content diversity. If your question generator tends to produce structurally similar problems (same formula, different constants), students get less practice value. Track semantic similarity within generated batches and regenerate when diversity falls below a threshold.

Copyright monitoring matters. Periodically audit generated content against known copyrighted sources using the same plagiarism detection tools you use for student work. LLMs can reproduce training data verbatim.

Common Mistakes

:::danger Deploying LLM-Generated Math Content Without Arithmetic Validation LLMs confidently generate incorrect arithmetic. A worked example with a calculation error is pedagogically harmful: students who learn from it build incorrect understanding, and when they get a problem wrong, they do not know why. Always run a mathematical validation step on any generated content with arithmetic - parse the expressions, execute them, and flag mismatches. This is automatable and should be mandatory before deployment. :::

:::danger Treating Reading Level Adaptation as Synonym Replacement Simplifying text at a lower reading level is not just substituting long words for short ones. A simplified explanation that removes the key mechanism ("photosynthesis converts light to sugar") in favor of a vague statement ("plants make their own food") conveys the outcome but not the process. Good reading level adaptation preserves the informational content while adjusting linguistic complexity. Always test adapted content by having a domain expert verify that the simplified version is still educationally complete. :::

:::warning Generating Questions at the Wrong Bloom's Level LLMs default to generating recall (Bloom's Level 1) questions unless explicitly constrained otherwise. A question like "What is the capital of France?" tests memory, not understanding. If your adaptive system needs Level 3-5 questions to challenge advanced students, validate generated questions against Bloom's level criteria. One approach: have a separate classifier predict the Bloom's level of a generated question and reject items that do not match the target level. :::

:::warning MCQ Distractors That Are Obviously Wrong Poorly generated distractors include answers that are the right type but wrong magnitude by a factor of 1000, answers that are grammatically incompatible with the question stem, or answers that are in a completely different domain. These do not probe understanding - a student can eliminate them without knowing the material. Validate distractor quality by checking: same format as correct answer, plausible magnitude, same grammatical structure as the question stem. :::

Interview Questions and Answers

Q1: How would you design a pipeline to generate high-quality MCQ questions for a K-12 math curriculum at scale?

A production pipeline would have these stages: First, define a content specification - for each topic at each grade level, specify target Bloom's level, known misconceptions (from learning science literature and teacher input), and difficulty range. Second, generate candidates using an instruction-tuned LLM with a structured prompt that includes topic, grade level, Bloom's level, and misconception list. Third, validate automatically - parse and execute any arithmetic in the question and answer, check Bloom's level with a classifier, validate distractor format. Fourth, human expert review of a sample and all flagged items. Fifth, pilot with a small student cohort and track performance (item discrimination index and difficulty) before full deployment.

The key constraint is that math content requires arithmetic validation that text content does not. I would make this validation blocking: no math question reaches students without arithmetic validation passing.

Q2: LLMs generate content at the wrong reading level despite explicit instructions. How do you fix this?

Several approaches: First, add reading level measurement to the generation loop. Generate, measure Flesch-Kincaid grade level, and if more than 2 grade levels off, regenerate with an adjusted temperature or explicit examples of target-level language. Second, few-shot prompting with examples at the target reading level - 3-5 examples of good explanations at the desired grade level dramatically improve calibration. Third, fine-tune on a dataset of grade-level-annotated educational texts - a fine-tuned Llama model on appropriately labeled curriculum data will be more reliably calibrated than a general-purpose instruction-following model. Fourth, two-stage: generate for correctness first, then pass through a separate reading level adaptation step.

Q3: What is the "worked example effect" and how does it inform AI content generation?

The worked example effect (Sweller and Cooper, 1985) is one of the most replicated findings in educational psychology: novices learn more efficiently from studying worked examples than from problem-solving. The reason is cognitive load theory: problem-solving requires searching a large solution space, consuming working memory that could otherwise be used for schema acquisition. Worked examples provide the solution path directly, freeing attention for understanding the underlying structure.

The implication for content generation: early in learning a new concept, generate more worked examples and fewer practice problems. As competence increases, fade the worked examples (worked example fading): first give fully worked examples, then examples with the last step missing, then the last two steps missing, until the student is solving fully independently. LLMs can generate both fully worked examples and partially completed examples if prompted correctly - the key is specifying exactly which steps to omit in the "faded" version.

Q4: How do you handle copyright and intellectual property concerns for AI-generated educational content?

Several layers of protection: First, at the prompt level, instruct the model to generate original content rather than reproduce existing material, and to generate from principles rather than copying from training data. Second, post-generation deduplication against known copyrighted sources - run generated content through the same plagiarism detection system used for student work. Third, legal review of your terms of service and the LLM provider's terms to understand IP ownership of generated content. Fourth, do not use LLMs fine-tuned on specific proprietary curriculum data sets without clearing the licensing, as generated content may reproduce that training data.

The more substantive concern is accuracy, not copyright. A generated explanation that is technically original but pedagogically incorrect causes more harm than one that borrows from a well-written textbook. Copyright and accuracy concerns should both be addressed in the review pipeline, but accuracy should be the higher priority.

Q5: You generate 10,000 practice problems and discover that most of them are structurally identical (same formula, different numbers). How do you increase diversity?

The problem is that LLMs, given a vague prompt like "generate practice problems for linear equations", default to the most common problem structure in their training data. Fix this with explicit diversity constraints: maintain a catalog of problem structures (one-variable linear, two-variable system, word problem, graph interpretation) and distribute generation requests across structures. Use semantic similarity to detect near-duplicate problems in the generated batch and reject them. Vary the prompt systematically: different contexts (science application, real-world scenario, abstract algebra), different operations (add to both sides, subtract, multiply, divide), different surface presentations (standard form, word problem, table). Track problem structure distribution in the deployed catalog and set alerts when any single structure exceeds 40% of the total.

Summary

Educational content generation with LLMs offers a solution to the content scale problem in adaptive learning: the ability to generate diverse, level-appropriate, taxonomy-aligned content on demand rather than from a fixed catalog. The pipeline requires careful attention to quality: Bloom's taxonomy alignment through structured prompting, reading level validation and adaptation, arithmetic validation for math content, distractor quality for MCQs, and a human review layer that every item must pass before reaching students. LLM hallucination in educational content is more harmful than in general use because students trust educational sources - the mitigation layers are not optional optimizations, they are prerequisites for responsible deployment.

Opening: The Content Scale Problem​

Why This Exists: The Content Bottleneck in Adaptive Systems​

Historical Context: From Templates to Transformers​

Core Concepts​

Educational Content Types​

Bloom's Taxonomy Alignment​

Quality Criteria for Educational Content​

Automatic Question Generation (AQG)​

Reading Level Adaptation​

Hallucination Mitigation in Educational Content​

Mermaid Diagram: Educational Content Generation Pipeline​

Code Examples​

Bloom's Taxonomy Question Generator​

MCQ Distractor Generation​

Reading Level Classifier and Adapter​

Educational Content Quality Scorer​

Automated Fact-Checking Pipeline​

Production Engineering Notes​

Common Mistakes​

Interview Questions and Answers​

Summary​