Skip to main content

Adaptive Learning Systems

Reading time: ~40 min · Interview relevance: High · Target roles: ML Engineer, EdTech Engineer, Learning Science Engineer

Opening: The Wrong Way to Teach Everyone

In 2014, a large US school district deployed a tablet-based learning platform to 47,000 students across all grade levels. The platform was praised in press releases, featured in education conferences, and celebrated as a digital transformation success story. Two years later, a randomized controlled trial found that students using the platform performed no better on standardized tests than students in traditional classrooms. In some subgroups they performed worse.

The post-mortem revealed a familiar failure mode. The platform delivered the same sequence of content to every student, just with a nicer interface. A student who already understood fractions spent thirty minutes on fraction basics before moving to something new. A student struggling with the concept of a variable was served algebra problems with no bridging explanation. The technology was modern. The pedagogical model was the same one-size-fits-all approach that classrooms had used for 150 years, digitized.

Adaptive learning is the attempt to fix this. The fundamental insight is that education has always been most effective one-on-one - the tutor who watches a student struggle with one step, explains it differently, and moves on when understanding clicks. The Socratic method is adaptive. The great teachers are adaptive. What has been missing is the infrastructure to deliver that kind of personalization at scale, to millions of simultaneous learners, without a human tutor in the room.

The challenge is hard because student knowledge is not directly observable. You cannot look inside a student's head and see whether they understand the Pythagorean theorem. You can only observe behavior - right answers, wrong answers, time spent, hints requested, retries, quitting. From this incomplete, noisy signal you must infer a latent knowledge state, predict what content will produce the most learning, and sequence accordingly. This is a statistical inference problem wrapped inside a pedagogical theory wrapped inside a product with real students.

Modern adaptive learning systems draw from three intellectual traditions: psychometric theory (how to measure latent ability from test responses), cognitive science (how learning actually works - spacing, interleaving, desirable difficulties), and machine learning (how to estimate parameters and make decisions under uncertainty). Getting all three right simultaneously is what separates systems that produce measurable learning gains from systems that just feel personalized.

This lesson covers the technical underpinnings of adaptive learning - item response theory, computerized adaptive testing, spaced repetition scheduling, knowledge space theory, multi-armed bandits for content selection, and how real systems like Duolingo and ALEKS translate theory into production code.


Why This Exists: The Problem with Static Curricula

Before adaptive systems, educational technology was essentially a content delivery problem. Put a textbook online. Add video. Add a quiz at the end of each chapter. This model fails for a simple reason: the optimal next item to show a student depends entirely on what that student already knows, and static curricula cannot account for that.

Consider two students in the same algebra course. Student A comes in knowing pre-algebra cold but is shaky on negative numbers. Student B is comfortable with negative numbers but has gaps in order of operations. A fixed sequence serves neither well. Student A wastes time on content they know and hits a wall when the hidden gap matters. Student B builds a shaky foundation on top of an unresolved prerequisite.

The learning science literature is clear on the principles that adaptive systems implement. Vygotsky's Zone of Proximal Development (ZPD) states that optimal learning happens at the edge of current competence - not below it (boredom, no growth) and not far above it (frustration, no purchase). This creates a targeting problem: you need to estimate where the ZPD edge is for each student on each concept at each moment, then find content that lands in that window.

Bloom's Taxonomy (1956, revised 2001) provides a hierarchy of cognitive levels - remember, understand, apply, analyze, evaluate, create - and the insight that moving up this hierarchy requires mastery of lower levels. An adaptive system can use this as a sequencing constraint: a student should not be asked to analyze something they cannot yet recall or understand.

Spaced repetition, grounded in Ebbinghaus's forgetting curve research from the 1880s, shows that the optimal time to review material is just before you forget it. This means the sequence of content is not just about introducing new material - it is also about re-surfacing old material at the right intervals. A pure mastery-based system without spaced repetition produces short-term performance gains that evaporate.

The adaptive learning problem, formally stated: given a student's interaction history Ht={(q1,a1,t1),(q2,a2,t2),...}H_t = \{(q_1, a_1, t_1), (q_2, a_2, t_2), ...\} where qiq_i is an item, ai{0,1}a_i \in \{0,1\} is correctness, and tit_i is a timestamp, select the next item qt+1q_{t+1} from a catalog to maximize expected learning gain over some horizon.


Historical Context: From Programmed Instruction to Intelligent Systems

The idea of adaptive instruction is not new. Sidney Pressey built a mechanical "teaching machine" in 1924 that gave immediate feedback on multiple-choice questions - wrong answers required re-selection before advancing. B.F. Skinner formalized this into "programmed instruction" in the 1950s: break content into small frames, have students respond, reinforce correct responses, branch based on performance.

The first computational adaptive testing system appeared in the 1970s when Lord and Novick formalized Item Response Theory, giving researchers a principled statistical framework for estimating latent ability from item responses. The Armed Services Vocational Aptitude Battery (ASVAB) moved to computerized adaptive testing in 1996 - the first large-scale deployment of CAT for high-stakes assessment.

ALEKS (Assessment and Learning in Knowledge Spaces) launched commercially in 1999, bringing Knowledge Space Theory from Doignon and Falmagne's 1985 mathematical work into a product used by millions of K-12 and college students. ALEKS treats knowledge as a combinatorial structure: which subsets of topics can a student know simultaneously given prerequisites.

The deep learning era began transforming adaptive learning around 2015 when Piech et al. published Deep Knowledge Tracing, replacing the hand-crafted hidden Markov models of Bayesian Knowledge Tracing with LSTMs that could learn knowledge dynamics from interaction data. This opened the door to end-to-end learned student models rather than psychometrically hand-designed ones.

Duolingo published their Half-Life Regression (HLR) paper in 2016, showing how spaced repetition could be learned from data rather than set by fixed formulas. This was one of the first papers to take a familiar cognitive science algorithm and show that its parameters could be optimized from millions of user sessions.


Core Concepts

Item Response Theory (IRT)

IRT models the probability that a student with ability θ\theta answers item ii correctly as a function of item parameters. The family of models ranges from simple to complex.

1PL (Rasch Model)

P(Xij=1θj,bi)=e(θjbi)1+e(θjbi)P(X_{ij} = 1 | \theta_j, b_i) = \frac{e^{(\theta_j - b_i)}}{1 + e^{(\theta_j - b_i)}}

Only one item parameter: bib_i, the difficulty. A student's probability of getting item ii right depends only on the difference between their ability and the item's difficulty. Elegant and interpretable, but assumes all items discriminate equally well.

2PL Model

P(Xij=1θj,ai,bi)=eai(θjbi)1+eai(θjbi)P(X_{ij} = 1 | \theta_j, a_i, b_i) = \frac{e^{a_i(\theta_j - b_i)}}{1 + e^{a_i(\theta_j - b_i)}}

Adds aia_i, the discrimination parameter. High-discrimination items (steep slope) sharply separate students above and below the difficulty threshold. Low-discrimination items provide less information.

3PL Model

P(Xij=1θj,ai,bi,ci)=ci+(1ci)eai(θjbi)1+eai(θjbi)P(X_{ij} = 1 | \theta_j, a_i, b_i, c_i) = c_i + (1 - c_i) \frac{e^{a_i(\theta_j - b_i)}}{1 + e^{a_i(\theta_j - b_i)}}

Adds cic_i, the pseudo-guessing parameter - the probability of a correct response for students of very low ability (i.e., chance on multiple-choice). This is important for high-stakes testing where guessing is a real behavior.

The information function I(θ)I(\theta) of an item measures how much that item reduces uncertainty about a student's ability at a given θ\theta level:

Ii(θ)=[Pi(θ)]2Pi(θ)Qi(θ)I_i(\theta) = \frac{[P'_i(\theta)]^2}{P_i(\theta) Q_i(\theta)}

where Pi(θ)P'_i(\theta) is the derivative of the item characteristic curve. The Fisher information is maximized when θbi\theta \approx b_i - an item is most informative when it is appropriately difficult for the student. This is the fundamental insight behind CAT.

Computerized Adaptive Testing (CAT)

CAT dynamically selects the next item to administer based on the current estimate of student ability. The algorithm:

  1. Initialize θ^0\hat{\theta}_0 (usually the population mean, e.g., 0)
  2. Select item i=argmaxiIi(θ^)i^* = \arg\max_i I_i(\hat{\theta}) from the item bank
  3. Administer item ii^*, observe response xx
  4. Update θ^\hat{\theta} using Maximum Likelihood Estimation or Expected A Posteriori
  5. Compute standard error SE(θ^)SE(\hat{\theta}); if below threshold, terminate; else go to 2

Maximum Likelihood Estimation of θ\theta after responses x\mathbf{x}:

θ^MLE=argmaxθiPi(θ)xi(1Pi(θ))1xi\hat{\theta}_{MLE} = \arg\max_\theta \prod_i P_i(\theta)^{x_i} (1-P_i(\theta))^{1-x_i}

EAP (Expected A Posteriori) incorporates a prior:

θ^EAP=θL(θx)π(θ)dθL(θx)π(θ)dθ\hat{\theta}_{EAP} = \frac{\int \theta \cdot L(\theta | \mathbf{x}) \pi(\theta) d\theta}{\int L(\theta | \mathbf{x}) \pi(\theta) d\theta}

A 30-item CAT can achieve the measurement precision of a 100-item fixed test by focusing item selection on the region of θ\theta where the student actually is. This 3x efficiency gain is why CAT dominates modern standardized testing.

Knowledge Space Theory and ALEKS

Knowledge Space Theory (Doignon and Falmagne, 1985) takes a different view of knowledge. Instead of a continuous latent variable, it represents knowledge as a combinatorial structure. A knowledge state is a subset of topics KQK \subseteq Q that a student knows. Not all subsets are valid knowledge states - prerequisites create constraints.

A knowledge space is the set of all valid states: K2Q\mathcal{K} \subseteq 2^Q where K\mathcal{K} is closed under union (if two valid states are unioned, the result is valid). The outer fringe of a state KK is the set of topics not in KK that are learnable given KK (all prerequisites satisfied). The inner fringe is the set of topics in KK that could be the most recently learned.

ALEKS uses this structure to: (1) assess which state the student is in through an adaptive assessment, (2) recommend items from the outer fringe (the ZPD), (3) track mastery as students move through the knowledge space.

Mastery Learning and Prerequisite Graphs

Mastery-based progression draws from Bloom and Carroll's work in the 1970s: every student can achieve mastery of a concept given sufficient time and appropriate instruction. The key design decision is the mastery threshold - what constitutes "knowing" a concept - and the prerequisite graph structure.

A prerequisite graph G=(V,E)G = (V, E) where VV is concepts and (u,v)E(u, v) \in E means uu must be mastered before vv. Topological sort gives a valid linear ordering, but the DAG structure enables branching paths and remediation loops.

In practice, prerequisite relationships are not binary. A soft prerequisite means the downstream concept is harder without the upstream one but not impossible. Mastery systems must handle this gracefully rather than blocking all forward progress on an unresolved prerequisite.

Spaced Repetition: The SM-2 Algorithm

The SuperMemo SM-2 algorithm (Wozniak, 1987) is the foundation of Anki, Duolingo, and dozens of other systems. It schedules reviews based on a personal ease factor:

  • After each review, rate quality q{0,...,5}q \in \{0,...,5\} (0-2 = failed, 3-5 = passed)
  • Update ease factor: EF=EF+(0.1(5q)(0.08+(5q)0.02))EF' = EF + (0.1 - (5-q)(0.08 + (5-q) \cdot 0.02)), minimum EF = 1.3
  • If q<3q < 3: reset interval to 1 day (failed - relearn)
  • If first review: interval = 1
  • If second review: interval = 6
  • Else: interval = prev_interval * EF'

The interval grows geometrically, so an easy card reviewed at intervals of 1, 6, 15, 38, 97 days... A hard card stays at shorter intervals. This approximates the optimal review schedule derived from Ebbinghaus's forgetting curve.

Duolingo's Half-Life Regression (HLR)

Duolingo's 2016 contribution was to make the spacing schedule learnable from data. Instead of fixed forgetting curves, HLR fits:

p(recall)=2Δ/hp(\text{recall}) = 2^{-\Delta/h}

where Δ\Delta is time since last review and hh is the half-life - how long until recall probability drops to 50%. The half-life is not fixed but is a function of features:

log2h=wTϕ(x)\log_2 h = \mathbf{w}^T \phi(x)

where ϕ(x)\phi(x) includes lexeme features, number of previous correct recalls, number of incorrect recalls, and time-based features. Trained on 13 million Duolingo interactions, HLR outperforms SM-2 because it learns that some words are inherently harder to retain, some users forget faster, and some features predict forgetting more accurately than a fixed formula.

Multi-Armed Bandit for Content Selection

When you do not know in advance which content sequence will produce the best learning outcome, you face an exploration-exploitation tradeoff. Multi-armed bandits provide a principled framework.

Thompson Sampling for content selection:

  • Each content item ii has a prior Beta(αi\alpha_i, βi\beta_i) on its "learning reward" probability
  • At each decision: sample p~iBeta(αi,βi)\tilde{p}_i \sim \text{Beta}(\alpha_i, \beta_i) for each arm; select argmaxip~i\arg\max_i \tilde{p}_i
  • Observe reward (e.g., mastery achieved within next N items); update posterior

The challenge in education is delayed, noisy rewards. Learning gain from item ii may only be measurable via a post-test hours or days later. Contextual bandits address this: instead of a single reward estimate per arm, estimate reward as a function of context (student features, current knowledge state).

LinUCB is a practical contextual bandit: estimate reward r=xTθar = \mathbf{x}^T \boldsymbol{\theta}_a where x\mathbf{x} is context and θa\boldsymbol{\theta}_a is arm-specific parameter, with UCB exploration bonus.

Collaborative Filtering for Educational Content

Students can be grouped by learning patterns, and content effective for similar students is likely effective for the current student. This is the CF insight applied to education.

Matrix factorization on the student-content interaction matrix RR where Rij=1R_{ij} = 1 if student ii achieved mastery after engaging with content jj:

minP,Q(i,j)Ω(RijpiTqj)2+λ(pi2+qj2)\min_{P,Q} \sum_{(i,j) \in \Omega} (R_{ij} - \mathbf{p}_i^T \mathbf{q}_j)^2 + \lambda(\|\mathbf{p}_i\|^2 + \|\mathbf{q}_j\|^2)

The learned embeddings pi\mathbf{p}_i and qj\mathbf{q}_j capture student and content "compatibility" in a latent space. This can be combined with knowledge state information to recommend content that is both well-matched to ability and likely to be effective based on similar students.


Mermaid Diagram: Adaptive Learning System Architecture


Code Examples

IRT 2PL Model Implementation

import numpy as np
from scipy.optimize import minimize
from scipy.special import expit # sigmoid

def irt_2pl_probability(theta, a, b):
"""
Probability of correct response under 2PL IRT model.

Args:
theta: student ability (scalar or array)
a: item discrimination parameter
b: item difficulty parameter
Returns:
P(correct | theta, a, b)
"""
return expit(a * (theta - b))

def item_information(theta, a, b):
"""Fisher information of a 2PL item at ability level theta."""
p = irt_2pl_probability(theta, a, b)
q = 1 - p
return (a ** 2) * p * q

def estimate_ability_mle(responses, item_params, theta_init=0.0):
"""
Maximum Likelihood Estimation of student ability.

Args:
responses: list of (item_idx, correct) tuples
item_params: array of shape (n_items, 2) with [a, b] per item
theta_init: starting estimate
Returns:
theta_hat: MLE ability estimate
se: standard error
"""
def neg_log_likelihood(theta):
ll = 0.0
for item_idx, correct in responses:
a, b = item_params[item_idx]
p = irt_2pl_probability(theta[0], a, b)
p = np.clip(p, 1e-9, 1 - 1e-9)
ll += correct * np.log(p) + (1 - correct) * np.log(1 - p)
return -ll

result = minimize(neg_log_likelihood, [theta_init], method='L-BFGS-B',
bounds=[(-4, 4)])
theta_hat = result.x[0]

# SE from observed Fisher information
info = sum(item_information(theta_hat, item_params[i][0], item_params[i][1])
for i, _ in responses)
se = 1.0 / np.sqrt(info) if info > 0 else np.inf

return theta_hat, se


class ComputerizedAdaptiveTest:
"""
Simple CAT implementation using maximum information item selection.
"""
def __init__(self, item_params, se_threshold=0.30, max_items=30):
"""
Args:
item_params: np.array of shape (n_items, 2) - [a, b] for each item
se_threshold: stop when SE of ability estimate drops below this
max_items: maximum items to administer
"""
self.item_params = item_params
self.se_threshold = se_threshold
self.max_items = max_items
self.administered = [] # list of item indices
self.responses = [] # list of (item_idx, correct) tuples
self.theta_hat = 0.0
self.se = np.inf

def select_next_item(self):
"""Select item with maximum Fisher information at current theta estimate."""
administered_set = set(self.administered)
best_item = None
best_info = -1

for i in range(len(self.item_params)):
if i in administered_set:
continue
a, b = self.item_params[i]
info = item_information(self.theta_hat, a, b)
if info > best_info:
best_info = info
best_item = i

return best_item

def administer_item(self, item_idx, correct):
"""Record a response and update ability estimate."""
self.administered.append(item_idx)
self.responses.append((item_idx, correct))

if len(self.responses) >= 1:
self.theta_hat, self.se = estimate_ability_mle(
self.responses, self.item_params, self.theta_hat
)

def should_terminate(self):
return (self.se < self.se_threshold or
len(self.administered) >= self.max_items)

def run(self, true_responses_fn):
"""
Run full CAT. true_responses_fn(item_idx) -> 0 or 1.
In production, this is replaced by actual student interaction.
"""
while not self.should_terminate():
item = self.select_next_item()
if item is None:
break
correct = true_responses_fn(item)
self.administer_item(item, correct)

return self.theta_hat, self.se, len(self.administered)


# --- Demo usage ---
np.random.seed(42)
n_items = 200
# Generate item bank: a in [0.5, 2.5], b in [-3, 3]
item_params = np.column_stack([
np.random.uniform(0.5, 2.5, n_items),
np.random.uniform(-3, 3, n_items)
])

# Simulate a student with true ability = 1.2
true_theta = 1.2
def simulated_response(item_idx):
a, b = item_params[item_idx]
p = irt_2pl_probability(true_theta, a, b)
return int(np.random.random() < p)

cat = ComputerizedAdaptiveTest(item_params, se_threshold=0.30, max_items=30)
theta_est, se, n_items_used = cat.run(simulated_response)
print(f"True theta: {true_theta:.2f}")
print(f"Estimated theta: {theta_est:.2f} (SE={se:.3f})")
print(f"Items used: {n_items_used} of {n_items}")

SM-2 Spaced Repetition Scheduler

from dataclasses import dataclass, field
from datetime import datetime, timedelta
from typing import List, Optional
import math

@dataclass
class CardState:
card_id: str
ease_factor: float = 2.5 # starts at 2.5
interval: int = 1 # days until next review
repetitions: int = 0 # successful reviews in a row
due_date: datetime = field(default_factory=datetime.now)

def is_due(self, now: Optional[datetime] = None) -> bool:
if now is None:
now = datetime.now()
return now >= self.due_date


def sm2_update(card: CardState, quality: int, review_time: Optional[datetime] = None) -> CardState:
"""
Update card state using SM-2 algorithm.

Args:
card: current card state
quality: response quality 0-5
0-2: failed (forgot), reset
3: correct but with difficulty
4: correct with some hesitation
5: perfect recall
review_time: when review occurred (default: now)

Returns:
updated CardState
"""
if review_time is None:
review_time = datetime.now()

# Clamp quality
quality = max(0, min(5, quality))

# Update ease factor
new_ef = card.ease_factor + (0.1 - (5 - quality) * (0.08 + (5 - quality) * 0.02))
new_ef = max(1.3, new_ef) # minimum ease factor

if quality < 3:
# Failed - reset to relearning
new_interval = 1
new_repetitions = 0
else:
# Passed - increase interval
new_repetitions = card.repetitions + 1
if new_repetitions == 1:
new_interval = 1
elif new_repetitions == 2:
new_interval = 6
else:
new_interval = round(card.interval * new_ef)

new_due = review_time + timedelta(days=new_interval)

return CardState(
card_id=card.card_id,
ease_factor=new_ef,
interval=new_interval,
repetitions=new_repetitions,
due_date=new_due
)


class SpacedRepetitionScheduler:
"""
SM-2 based spaced repetition system for a deck of cards.
"""
def __init__(self):
self.cards: dict[str, CardState] = {}

def add_card(self, card_id: str):
self.cards[card_id] = CardState(card_id=card_id)

def get_due_cards(self, now: Optional[datetime] = None, limit: int = 20) -> List[str]:
"""Return IDs of cards due for review, sorted by overdueness."""
if now is None:
now = datetime.now()
due = [c for c in self.cards.values() if c.is_due(now)]
# Sort: most overdue first
due.sort(key=lambda c: c.due_date)
return [c.card_id for c in due[:limit]]

def record_review(self, card_id: str, quality: int, review_time: Optional[datetime] = None):
"""Record a review result and update the card's schedule."""
if card_id not in self.cards:
raise ValueError(f"Unknown card: {card_id}")
self.cards[card_id] = sm2_update(self.cards[card_id], quality, review_time)

def retention_rate(self, now: Optional[datetime] = None) -> float:
"""Estimate current retention as fraction of cards not overdue by > 1 day."""
if now is None:
now = datetime.now()
if not self.cards:
return 1.0
on_time = sum(1 for c in self.cards.values()
if (now - c.due_date).days <= 1)
return on_time / len(self.cards)

Prerequisite Graph Construction and Traversal

from collections import defaultdict, deque
from typing import Dict, Set, List, Tuple

class PrerequisiteGraph:
"""
DAG of learning concepts with prerequisite relationships.
Used for mastery-based sequencing and ZPD computation.
"""
def __init__(self):
self.prerequisites: Dict[str, Set[str]] = defaultdict(set)
self.dependents: Dict[str, Set[str]] = defaultdict(set)
self.all_concepts: Set[str] = set()

def add_prerequisite(self, concept: str, prereq: str):
"""prereq must be mastered before concept can be learned."""
self.prerequisites[concept].add(prereq)
self.dependents[prereq].add(concept)
self.all_concepts.add(concept)
self.all_concepts.add(prereq)

def get_outer_fringe(self, mastered: Set[str]) -> Set[str]:
"""
ALEKS outer fringe: concepts not yet mastered whose
prerequisites are all mastered (learnable next).
This is the Zone of Proximal Development.
"""
fringe = set()
for concept in self.all_concepts:
if concept in mastered:
continue
prereqs = self.prerequisites[concept]
if prereqs.issubset(mastered):
fringe.add(concept)
return fringe

def get_inner_fringe(self, mastered: Set[str]) -> Set[str]:
"""
ALEKS inner fringe: mastered concepts that could be
the most recently learned (no mastered concept depends on them
within the mastered set).
"""
inner = set()
for concept in mastered:
deps_in_mastered = self.dependents[concept].intersection(mastered)
if not deps_in_mastered:
inner.add(concept)
return inner

def topological_sort(self) -> List[str]:
"""Return concepts in valid learning order (prerequisites first)."""
in_degree = {c: len(self.prerequisites[c]) for c in self.all_concepts}
queue = deque([c for c, d in in_degree.items() if d == 0])
order = []

while queue:
concept = queue.popleft()
order.append(concept)
for dep in self.dependents[concept]:
in_degree[dep] -= 1
if in_degree[dep] == 0:
queue.append(dep)

if len(order) != len(self.all_concepts):
raise ValueError("Prerequisite graph has a cycle!")
return order

def remediation_path(self, target: str, mastered: Set[str]) -> List[str]:
"""
Find the shortest path to mastery of target,
given current mastered set.
"""
if target in mastered:
return []

# BFS from target backwards through prerequisites
needed = []
queue = deque([target])
visited = set()

while queue:
concept = queue.popleft()
if concept in visited or concept in mastered:
continue
visited.add(concept)
needed.append(concept)
for prereq in self.prerequisites[concept]:
if prereq not in mastered:
queue.append(prereq)

# Return in topological order
topo = self.topological_sort()
return [c for c in topo if c in set(needed)]


# --- Demo: algebra prerequisite graph ---
g = PrerequisiteGraph()
g.add_prerequisite("negative_numbers", "integers")
g.add_prerequisite("fractions", "integers")
g.add_prerequisite("order_of_operations", "integers")
g.add_prerequisite("algebra_basics", "negative_numbers")
g.add_prerequisite("algebra_basics", "order_of_operations")
g.add_prerequisite("linear_equations", "algebra_basics")
g.add_prerequisite("quadratic_equations", "linear_equations")
g.add_prerequisite("quadratic_equations", "fractions")

mastered = {"integers", "negative_numbers", "fractions"}
fringe = g.get_outer_fringe(mastered)
print(f"Current ZPD (learnable next): {fringe}")
# Output: {'order_of_operations'}

path = g.remediation_path("quadratic_equations", mastered)
print(f"Path to quadratic equations: {path}")

Multi-Armed Bandit for Content Selection

import numpy as np
from dataclasses import dataclass, field
from typing import Dict, List, Tuple

@dataclass
class ContentArm:
content_id: str
alpha: float = 1.0 # Beta prior successes
beta: float = 1.0 # Beta prior failures

@property
def mean(self) -> float:
return self.alpha / (self.alpha + self.beta)

def sample(self) -> float:
return np.random.beta(self.alpha, self.beta)

def update(self, reward: float):
"""reward: 1 if mastery achieved, 0 otherwise"""
self.alpha += reward
self.beta += (1 - reward)


class ThompsonSamplingContentSelector:
"""
Thompson Sampling for adaptive content selection.
Selects content that maximizes expected learning reward
while exploring uncertain options.
"""
def __init__(self, content_ids: List[str]):
self.arms: Dict[str, ContentArm] = {
cid: ContentArm(content_id=cid) for cid in content_ids
}

def select(self, available_ids: List[str]) -> str:
"""
Select content from available items using Thompson Sampling.
available_ids: subset of content IDs eligible for this student.
"""
if not available_ids:
raise ValueError("No available content to select from")

samples = {cid: self.arms[cid].sample()
for cid in available_ids if cid in self.arms}
return max(samples, key=samples.get)

def record_outcome(self, content_id: str, mastery_achieved: bool):
"""Update the arm's posterior based on observed outcome."""
if content_id in self.arms:
self.arms[content_id].update(float(mastery_achieved))

def get_estimates(self) -> List[Tuple[str, float]]:
"""Return (content_id, mean_reward) sorted descending."""
return sorted(
[(cid, arm.mean) for cid, arm in self.arms.items()],
key=lambda x: -x[1]
)

Learning Gain Evaluation

import numpy as np
from scipy import stats

def normalized_learning_gain(pre_score: float, post_score: float, max_score: float = 1.0) -> float:
"""
Hake's normalized learning gain: measures how much students improved
relative to how much they could have improved.

g = (post - pre) / (max - pre)

g > 0.7: high gain
0.3 < g < 0.7: medium gain
g < 0.3: low gain
"""
if pre_score >= max_score:
return 0.0 # already at ceiling
return (post_score - pre_score) / (max_score - pre_score)

def evaluate_adaptive_system(
treatment_pre: np.ndarray,
treatment_post: np.ndarray,
control_pre: np.ndarray,
control_post: np.ndarray
) -> dict:
"""
Evaluate learning gain of adaptive system vs control.

Args:
treatment_pre/post: pre/post scores for adaptive system group
control_pre/post: pre/post scores for control group
Returns:
dict of evaluation metrics
"""
# Normalized learning gains
treatment_gains = np.array([
normalized_learning_gain(pre, post)
for pre, post in zip(treatment_pre, treatment_post)
])
control_gains = np.array([
normalized_learning_gain(pre, post)
for pre, post in zip(control_pre, control_post)
])

# Two-sample t-test
t_stat, p_value = stats.ttest_ind(treatment_gains, control_gains)

# Effect size (Cohen's d)
pooled_std = np.sqrt((treatment_gains.std()**2 + control_gains.std()**2) / 2)
cohens_d = (treatment_gains.mean() - control_gains.mean()) / pooled_std

# Time-to-mastery comparison (if available)
# Using gain ratio as proxy

return {
"treatment_mean_gain": treatment_gains.mean(),
"control_mean_gain": control_gains.mean(),
"gain_difference": treatment_gains.mean() - control_gains.mean(),
"p_value": p_value,
"cohens_d": cohens_d,
"significant_at_05": p_value < 0.05,
"effect_size_interpretation": (
"small" if abs(cohens_d) < 0.3 else
"medium" if abs(cohens_d) < 0.5 else "large"
)
}

# Demo
np.random.seed(42)
n = 200
pre_t = np.random.beta(3, 5, n) # treatment group pre-scores
post_t = np.clip(pre_t + np.random.beta(4, 4, n) * (1 - pre_t), 0, 1) # improved
pre_c = np.random.beta(3, 5, n) # control group pre-scores
post_c = np.clip(pre_c + np.random.beta(2, 5, n) * (1 - pre_c), 0, 1) # less improvement

results = evaluate_adaptive_system(pre_t, post_t, pre_c, post_c)
for k, v in results.items():
print(f"{k}: {v:.3f}" if isinstance(v, float) else f"{k}: {v}")

Production Engineering Notes

Data collection is the foundation. Before building any adaptive algorithm, instrument every interaction: question ID, response correctness, response latency, hint requests, retries, session timestamp, device. Without rich interaction logs, you cannot train or evaluate any student model. Build the logging infrastructure first.

Cold start problem is real. New students have no history, and new content items have no response data. Address cold start for students by using a short diagnostic assessment (5-10 CAT items) before the first learning session. Address cold start for items by assigning them student-ability-matched reviews from calibration runs.

Calibrate item parameters continuously. IRT parameters estimated at launch become stale as curriculum changes. Run periodic re-calibration jobs against fresh response data. Flag items whose empirical difficulty has drifted more than 0.5 standard deviations from their calibrated value.

Decouple assessment from learning. The items you use to estimate ability (CAT) should be different from the items you use for learning and practice. Mixing them creates contamination where students who see items repeatedly during learning perform better on "assessment" items they have already seen.

Do not optimize for engagement. This is the hardest constraint to enforce in a commercial edtech product. Engagement metrics (time-on-platform, sessions per week, items completed) are available immediately. Learning gains require delayed post-tests. Your adaptive algorithm must be evaluated on the latter. Build a holdout group that gets post-tests even when the product does not require them.

Prerequisite graph maintenance is ongoing work. Prerequisites change when curriculum changes, when you discover empirically that students can learn concept B without A, or when pedagogical approaches change. Build admin tooling to update the graph and version-control it like code.


Common Mistakes

:::danger Optimizing for Engagement Instead of Learning This is the most common failure in EdTech AI. Engagement is easy to measure and immediately available. Learning gains require delayed post-tests and experimental controls. If your adaptive system is optimized only on time-on-platform or completion rate, it may produce the opposite of learning - it may keep students comfortable rather than challenged.

Always maintain a holdout group, administer delayed post-tests, and report learning gains as the primary metric. Every other metric is a proxy. :::

:::danger Setting Prerequisites Too Strictly A common mistake in knowledge space-based systems is making prerequisite relationships binary and strict. In practice, soft prerequisites are the norm: understanding negative numbers helps with algebra but does not make it impossible. Strict blocking leads to students stuck on a prerequisite they will never fully master, unable to make progress on downstream topics.

Use soft prerequisites: adjust difficulty of downstream items based on upstream mastery level rather than blocking access entirely. :::

:::warning IRT Assumes Unidimensionality Standard IRT models assume a single latent ability trait explains all item responses. In practice, a math course may involve arithmetic ability, reading comprehension, and spatial reasoning as separate dimensions. If your items span multiple dimensions, a unidimensional IRT model will produce ability estimates that are blends of these dimensions, making the estimates hard to interpret.

Consider multidimensional IRT or separate models per knowledge domain when your content spans clearly distinct skill areas. :::

:::warning Forgetting Is Not Failure In a mastery-based system, returning to an item the student previously got correct and finding they now get it wrong can look like regress. It is normal forgetting. Do not reset mastery completely on a single failed review - use a Bayesian update that requires multiple failures before revising mastery downward. :::


Interview Questions and Answers

Q1: How would you formalize the Zone of Proximal Development as an optimization objective for item selection?

The ZPD says optimal learning items are just beyond current competence. In IRT terms, an item at difficulty bθ^+ϵb \approx \hat{\theta} + \epsilon for some small ϵ\epsilon is in the ZPD. Too easy (bθb \ll \theta) produces low engagement and no new learning. Too hard (bθb \gg \theta) produces frustration and low success probability.

Formally, the item selection objective combines information (for ability estimation) and expected learning signal. A simple ZPD objective: select item i=argmaxiIi(θ^)1[P(correctθ^,ai,bi)(0.6,0.8)]i^* = \arg\max_i I_i(\hat{\theta}) \cdot \mathbb{1}[P(correct | \hat{\theta}, a_i, b_i) \in (0.6, 0.8)]. This maximizes Fisher information while constraining to items where the student is likely to succeed but not guaranteed to succeed. The 0.6-0.8 window is an empirical approximation to the ZPD.

A more principled approach uses the expected change in knowledge state as the reward signal, estimated from historical data on similar students.

Q2: What is the tradeoff between exploration and exploitation in adaptive learning, and how does Thompson Sampling handle it?

Exploitation means showing content known to produce learning for students like this one. Exploration means trying new content to discover whether it might work even better. Pure exploitation misses improvements; pure exploration wastes student time on sub-optimal content.

Thompson Sampling maintains a Beta distribution per content item representing uncertainty about its learning reward probability. By sampling from the posterior and selecting the item with the highest sampled value, it automatically explores more when uncertainty is high (wide distribution) and exploits more when uncertainty is low (narrow distribution). This Bayesian approach is well-calibrated: the probability of selecting arm ii equals the probability that arm ii is truly the best arm.

The key challenge in education is delayed rewards - you may not know whether content worked until a post-test days later. Contextual bandits that incorporate student state features (knowledge level, time in session, previous performance) can better propagate the reward signal.

Q3: How do you evaluate an adaptive learning system beyond completion rates?

The gold standard is a randomized controlled trial measuring normalized learning gain: g=(postpre)/(maxpre)g = (post - pre) / (max - pre). Split students into treatment (adaptive system) and control (fixed curriculum) groups, measure pre and post scores on the same instrument, compute Hake's gg for each group, and test for a statistically significant difference with appropriate effect size reporting (Cohen's dd).

Additional metrics: time-to-mastery (does the adaptive system achieve the same learning in less time?), retention at 30/60/90 days (does spaced repetition prevent forgetting?), equity analysis (do learning gains hold across demographic subgroups?), and engagement-to-learning ratio (how much time did it take to produce xx learning gain?).

Beware of using assessment performance on the same items used for practice - this inflates apparent learning gain through item exposure effects.

Q4: How would you handle the cold start problem for a new student in a CAT-based adaptive system?

Cold start occurs when a new student has no interaction history to inform the ability estimate. Three strategies:

First, use a short diagnostic assessment - administer 5-10 pre-selected items spanning a wide difficulty range, estimate initial θ^\hat{\theta} from these responses, then run the full adaptive algorithm. The diagnostic items should be high-discrimination and span the ability range evenly.

Second, use demographic and contextual priors. If you know the student is entering a 9th-grade algebra course, prior ability is approximately N(0,1)\mathcal{N}(0, 1) in the domain. If they self-report prior experience, update the prior accordingly.

Third, use collaborative filtering to initialize the student embedding from the profile of similar students (matched by any available features - grade level, prior course completion, diagnostic performance). This is particularly useful for new content recommendations, not just ability estimation.

Q5: ALEKS uses Knowledge Space Theory instead of IRT. What are the tradeoffs?

Knowledge Space Theory treats knowledge as combinatorial structure - valid subsets of topics - rather than a continuous latent variable. This has several advantages for educational applications: the model is interpretable (you can show students exactly what they know and what they can learn next), it naturally handles prerequisite dependencies, and it maps directly to curriculum structure rather than an abstract ability continuum.

The tradeoffs: KST requires expert knowledge to define the valid knowledge states (or expensive data collection to infer them), it does not naturally handle partial mastery or gradations of understanding (you either know a topic or you do not), and it becomes computationally expensive as the topic space grows (the number of knowledge states can be exponential in the number of topics).

IRT is more statistically principled for assessment (produces calibrated uncertainty estimates, handles noisy responses via probability), generalizes better to new items, and requires less domain expertise to deploy. KST is better for curriculum-structured learning systems where the concept space is well-defined and prerequisite dependencies are strong.

Q6: Duolingo's HLR learns forgetting curves from data. What could go wrong in production?

Several failure modes: distribution shift occurs when the training data (users from a certain time period and demographic) does not reflect future users - HLR may learn forgetting rates optimized for one user population that do not generalize. Feedback loops can form where HLR schedules items more frequently for users who are forgetting quickly, but increased frequency changes the forgetting dynamics in ways the model was not trained on. Simpson's paradox can emerge in aggregate recall statistics - mixing users with different learning paces produces misleading average forgetting curves.

Mitigation: retrain HLR periodically on recent data, monitor per-cohort forgetting curves for drift, hold out a random sample of reviews to validate that actual recall matches predicted recall, and include user-level random effects in the model to capture individual differences in forgetting rates.


Summary

Adaptive learning systems are the technical implementation of a pedagogical insight: the best instruction meets each student where they are. IRT provides the statistical foundation for measuring latent ability and selecting maximally informative items. CAT applies this to dynamic assessment with 3x efficiency gains over fixed tests. Knowledge Space Theory models the combinatorial structure of a curriculum. SM-2 and HLR schedule reviews to counteract forgetting. Multi-armed bandits handle exploration-exploitation when content effectiveness is uncertain. Collaborative filtering leverages patterns across students.

The hardest part is not the algorithm - it is the evaluation. Adaptive systems must be measured on learning gains from controlled experiments, not engagement metrics. Build the measurement infrastructure before you build the adaptive algorithm, or you will optimize for the wrong thing and not know it.

© 2026 EngineersOfAI. All rights reserved.