Knowledge Tracing Models

Reading time: ~42 min · Interview relevance: High · Target roles: ML Engineer, EdTech Engineer, Research Engineer

Opening: The Problem of Invisible Knowledge

You are building an adaptive math tutor. A student just answered eight questions in a row about two-step linear equations. Five were correct, three were wrong. The wrong ones were on questions 3, 5, and 7. Now the student has moved to a section on systems of linear equations, which requires mastery of two-step equations as a prerequisite.

What should you do? The naive approach is a threshold rule: if the student got more than 70% correct, they "know" two-step equations and can proceed. But this misses everything interesting. Which of the eight questions were easy and which were hard? Was the student showing a learning trajectory (wrong early, right later, suggesting active learning) or a performance trajectory (right on easy ones, wrong on hard ones, suggesting superficial knowledge)? How much has the student retained from a previous session three days ago? What does performance on these eight questions predict about performance on the new topic?

Knowledge tracing is the technical discipline of answering these questions. It models student knowledge as a latent variable that evolves over time, observable only through the noisy signal of practice responses. The goal is not just to summarize past performance - it is to infer the current knowledge state and predict future performance as precisely as possible, so that the adaptive system can make the best next content decision.

Knowledge tracing models range from hand-crafted probabilistic models (Bayesian Knowledge Tracing, 1994) to deep sequence models (Deep Knowledge Tracing, 2015) to transformer-based architectures (SAKT, 2019; AKT, 2021). Each generation improved predictive accuracy, but each also introduced new challenges: deep models require more data, are harder to interpret, and raise new questions about what they are actually learning.

This lesson covers the full knowledge tracing model family in depth: BKT's hidden Markov model foundations, DKT's LSTM approach, DKVMN's memory-augmented architecture, SAKT's self-attention approach, AKT's Rasch model integration, evaluation methodology, the major datasets, and production considerations for real-time knowledge state estimation.

Why This Exists: From Score Aggregates to Knowledge State Estimates

Before knowledge tracing, student state was represented with raw score averages: "this student has answered 73% of linear equation questions correctly." This is useful but insufficient. It does not account for question difficulty, temporal dynamics, or the difference between a student who knew the material last month and one who learned it today.

Knowledge tracing provides three things that raw averages cannot:

First, it estimates the probability of correctness on the next unseen problem, not just summarizes past performance. This enables adaptive systems to target items where the student needs improvement.

Second, it models learning dynamics - how knowledge grows with practice and decays with time. A student who got 5/10 questions right while learning scores differently from one who got 5/10 questions right while forgetting.

Third, it provides per-skill estimates, not just overall performance. A student might have strong knowledge of solving one-variable equations but weak knowledge of systems of equations, and a good knowledge tracing model separates these.

Historical Context: Three Decades of Student Modeling

1994 - Bayesian Knowledge Tracing (BKT): Corbett and Anderson introduced BKT in the context of the LISP Tutor. BKT models each skill as a two-state hidden Markov model: unknown or known. Transitions are one-directional (students learn, do not forget). The model has four parameters per skill. Despite its simplicity, BKT remained state of the art in educational deployments for twenty years because it is interpretable, requires minimal data, and performs well on short interaction sequences.

2015 - Deep Knowledge Tracing (DKT): Piech et al. (Stanford) applied LSTMs to knowledge tracing. Instead of hand-crafted transition parameters, DKT learns a continuous hidden state representation from interaction sequences. DKT outperformed BKT significantly on the ASSISTments dataset (AUC 0.86 vs 0.67). This paper started a wave of deep learning approaches to educational data mining.

2017 - DKVMN: Zhang et al. introduced Dynamic Key-Value Memory Networks (DKVMN), which model knowledge concepts explicitly as memory slots. The key matrix holds concept representations; the value matrix holds mastery levels. This added interpretability back to deep models: you can inspect the value memory to see per-concept mastery.

2019 - SAKT: Pandey and Karypis introduced Self-Attentive Knowledge Tracing. By replacing the LSTM with a transformer self-attention mechanism, SAKT captures long-range dependencies in interaction sequences and parallelizes training. SAKT outperformed DKT on multiple datasets.

2021 - AKT: Ghosh et al. published Attentive Knowledge Tracing, which integrates Item Response Theory parameters (difficulty, discrimination) into a transformer-based model. AKT's attention mechanism is guided by Rasch model item parameters, creating a theoretically grounded deep model.

Core Concepts

Bayesian Knowledge Tracing (BKT)

BKT models each skill $s$ as a binary hidden variable: $L_t \in \{0, 1\}$ where 1 = "knows the skill." The model has four parameters:

$P(L_0)$ : initial probability of knowing the skill
$P(T)$ : probability of transitioning from unknown to known after a practice attempt (learning rate)
$P(G)$ : probability of answering correctly despite not knowing (guess probability)
$P(S)$ : probability of answering incorrectly despite knowing (slip probability)

The emission probabilities:

$P(\text{correct} | L=1) = 1 - P(S)$

$P(\text{correct} | L=0) = P(G)$

The knowledge update after observing a response uses Bayes' theorem. After an interaction at time $t$ with result $r_t$ :

$P(L_t | r_{1:t}) = \frac{P(r_t | L_t) \cdot P(L_t | r_{1:t-1})}{\sum_{l \in \{0,1\}} P(r_t | L=l) \cdot P(L=l | r_{1:t-1})}$

Then apply the learning transition:

$P(L_{t+1} | r_{1:t}) = P(L_t | r_{1:t}) + (1 - P(L_t | r_{1:t})) \cdot P(T)$

This is the forward algorithm for a two-state HMM with a one-directional transition (from unknown to known only - BKT assumes no forgetting).

BKT parameters are fit per skill using Expectation-Maximization. The E-step computes the posterior over the hidden state given observations; the M-step updates parameters to maximize expected log-likelihood.

BKT limitations:

No forgetting: in reality, students forget
Binary knowledge state: no partial knowledge
Ignores item difficulty: a correct answer on a hard problem is evidence of different quality than on an easy one
Skill independence: skills are modeled independently, ignoring relationships

Deep Knowledge Tracing (DKT)

DKT treats knowledge tracing as a sequence prediction problem. Given the interaction sequence $(q_1, a_1), (q_2, a_2), ..., (q_{t-1}, a_{t-1})$ , predict the correctness of the next response $a_t$ on item $q_t$ .

Input representation: each interaction $(q_i, a_i)$ is represented as a one-hot vector of size $2|Q|$ where $|Q|$ is the number of distinct skills. Index $q_i$ is active if correct; index $q_i + |Q|$ is active if incorrect.

The LSTM processes this sequence:

$\mathbf{h}_t = \text{LSTM}(\mathbf{h}_{t-1}, \mathbf{x}_t)$

The prediction for correctness on skill $q_t$ is obtained by extracting the component of the output that corresponds to $q_t$ :

$\hat{y}_t = \sigma(\mathbf{W}_y \mathbf{h}_t + \mathbf{b}_y)_{q_t}$

Training uses binary cross-entropy:

$\mathcal{L} = -\sum_t a_t \log \hat{y}_t + (1 - a_t) \log(1 - \hat{y}_t)$

DKT advantages over BKT:

Learns knowledge dynamics from data, no hand-crafted parameters
Handles forgetting naturally (LSTM can represent decaying knowledge)
Captures skill interactions (performance on algebra affects prediction on geometry)
Scales with data: more interaction data = better predictions

DKT limitations:

Not interpretable: the hidden state is a dense vector, not a human-readable knowledge profile
Requires substantial data to outperform BKT (short interaction sequences may not provide enough signal)
Does not respect the knowledge tracing constraint: the prediction should only use past information, not future observations (train with causal masking)

DKVMN: Dynamic Key-Value Memory Networks

DKVMN adds explicit memory for knowledge concepts. The architecture has two matrices:

Key matrix $M^K$ of shape $(N, d_k)$ : $N$ latent concepts, each a $d_k$ -dimensional embedding. These are static.
Value matrix $M^V$ of shape $(N, d_v)$ : mastery levels for each latent concept. These are dynamically updated.

For each interaction $(q_t, a_t)$ :

Read: Compute attention weights over concepts: $w_t = \text{softmax}(M^K \mathbf{e}_{q_t}^T)$

Read current knowledge state: $\mathbf{r}_t = \sum_n w_t^{(n)} M^V_{(n)}$
Predict: $\hat{y}_t = \sigma(f(\mathbf{r}_t, \mathbf{e}_{q_t}))$ where $f$ is a small MLP
Write: Erase and add to value memory based on the interaction: $M^V_{(n)} \leftarrow M^V_{(n)}(1 - w_t^{(n)} \mathbf{e}_t) + w_t^{(n)} \mathbf{a}_t$

where $\mathbf{e}_t$ and $\mathbf{a}_t$ are erase and add vectors computed from the interaction.

The value memory can be inspected to understand which knowledge concepts the student has strong vs weak mastery of - this interpretability is a key advantage over pure LSTM models.

SAKT: Self-Attentive Knowledge Tracing

SAKT replaces the LSTM with a transformer self-attention mechanism. The key insight: transformer attention can capture which past interactions are most relevant to predicting the current question, regardless of temporal distance.

Input: sequence of past interactions $\{(q_1, a_1), ..., (q_{t-1}, a_{t-1})\}$ and current question $q_t$ .

Embed exercises: $\mathbf{e}_{q_i} \in \mathbb{R}^{d}$ for each skill
Embed interactions: $\mathbf{m}_i \in \mathbb{R}^d$ for each $(q_i, a_i)$ pair
Self-attention over interaction embeddings, where the query comes from the current exercise:

$\mathbf{F} = \text{Attention}(\mathbf{e}_{q_t} \mathbf{W}^Q, \mathbf{M} \mathbf{W}^K, \mathbf{M} \mathbf{W}^V)$

Predict: $\hat{y}_t = \sigma(\text{FFN}(\mathbf{F}))$

SAKT's self-attention weights reveal which past interactions the model considers most informative for the current prediction, providing a form of interpretability: "the model weighted these three past exercises most highly when predicting your performance on this question."

AKT: Attentive Knowledge Tracing with Rasch

AKT (Ghosh et al., 2021) integrates Rasch model parameters into a transformer. The Rasch model says the probability of correctness depends on the difference between student ability and item difficulty. AKT encodes this by:

Learning difficulty embeddings per item $\mathbf{d}_i$ (not just concept embeddings)
Modifying the attention mechanism to weight by Rasch-inspired difficulty information
Using a monotonic attention constraint: more recent interactions receive higher attention weight

The Rasch-inspired attention weight:

$\alpha_{tj} = \frac{\exp((\mathbf{q}_t - \mathbf{d}_j)^T (\mathbf{k}_j - \mathbf{d}_j) / \sqrt{d})}{\sum_{j'} \exp((...)/\sqrt{d})}$

This means the model explicitly considers item difficulty when computing attention - a correct answer on a hard item provides stronger evidence of knowledge than a correct answer on an easy item.

AKT consistently outperforms DKT and SAKT on benchmark datasets (AUC on ASSISTments 2009: 0.79 vs 0.74 SAKT vs 0.73 DKT).

Forgetting in Knowledge Tracing

BKT originally assumed no forgetting (one-directional transitions). This is wrong for most domains. The research on temporal modeling in knowledge tracing shows:

Correct responses have decaying evidence value over time (Forgetting Curve)
The recency bias: more recent correct responses should weight more in predicting current knowledge
DAS3H (2019): combines BKT with a time-aware forgetting function based on the Ebbinghaus curve

For production systems, adding a temporal decay to the knowledge estimate is important for students who return after an absence. A student who was 95% likely to know a concept at last session should be downgraded based on elapsed time.

Evaluation: AUC on Next-Question Prediction

The standard evaluation task: given the first $t-1$ interactions, predict correctness of the $t$ -th response.

AUC (Area Under ROC Curve): measures the model's discriminative ability. AUC = 0.5 is random; AUC = 1.0 is perfect. On ASSISTments 2009: BKT ~0.67, DKT ~0.73, SAKT ~0.74, AKT ~0.79.

RMSE: root mean squared error between predicted probability and actual binary outcome.

$RMSE = \sqrt{\frac{1}{N}\sum_i (y_i - \hat{y}_i)^2}$

Important caveat about DKT evaluation: Khajah et al. (2016) showed that DKT can achieve high AUC by simply memorizing which items are harder, without actually modeling knowledge state. Evaluation should include:

AUC on held-out students (not held-out time steps)
Ablation showing the model degrades without knowledge state representation
Curriculum effect analysis: does predicted mastery increase with practice?

Mermaid Diagram: Knowledge Tracing Model Evolution

Code Examples

BKT Implementation with Forward-Backward Algorithm

import numpy as np
from scipy.optimize import minimize
from typing import List, Tuple, Dict

class BayesianKnowledgeTracing:
    """
    Bayesian Knowledge Tracing for a single skill.
    Models knowledge as a two-state HMM: unknown (0) or known (1).
    """
    def __init__(
        self,
        p_l0: float = 0.3,   # P(initial knowledge)
        p_t: float = 0.1,    # P(transition: unknown -> known)
        p_s: float = 0.1,    # P(slip: correct given known)
        p_g: float = 0.2     # P(guess: correct given unknown)
    ):
        self.p_l0 = p_l0
        self.p_t = p_t
        self.p_s = p_s
        self.p_g = p_g

    def predict_next(self, knowledge_prob: float) -> float:
        """P(correct on next item) given current knowledge probability."""
        return knowledge_prob * (1 - self.p_s) + (1 - knowledge_prob) * self.p_g

    def update_knowledge(self, knowledge_prob: float, correct: bool) -> float:
        """
        Update knowledge estimate after observing a response.
        Uses Bayes' theorem + learning transition.
        """
        if correct:
            # P(correct | known) * P(known) / P(correct)
            p_correct = self.predict_next(knowledge_prob)
            knowledge_after = ((1 - self.p_s) * knowledge_prob) / p_correct
        else:
            # P(incorrect | known) * P(known) / P(incorrect)
            p_incorrect = 1 - self.predict_next(knowledge_prob)
            knowledge_after = (self.p_s * knowledge_prob) / p_incorrect

        # Apply learning transition: student may have learned from this attempt
        knowledge_next = knowledge_after + (1 - knowledge_after) * self.p_t
        return min(1.0, max(0.0, knowledge_next))

    def trace(self, responses: List[bool]) -> List[float]:
        """
        Trace knowledge over a sequence of responses.

        Args:
            responses: list of bool (True = correct)

        Returns:
            list of knowledge probabilities after each response
        """
        knowledge = self.p_l0
        knowledge_trace = []

        for correct in responses:
            knowledge = self.update_knowledge(knowledge, correct)
            knowledge_trace.append(knowledge)

        return knowledge_trace

    def fit(self, all_sequences: List[List[bool]], verbose: bool = False) -> Dict:
        """
        Fit BKT parameters via EM (simplified gradient-free optimization).

        Args:
            all_sequences: list of response sequences per student
        Returns:
            fitted parameters dict
        """
        def neg_log_likelihood(params):
            p_l0, p_t, p_s, p_g = params
            # Clamp parameters to valid range
            p_l0 = np.clip(p_l0, 0.01, 0.99)
            p_t = np.clip(p_t, 0.01, 0.99)
            p_s = np.clip(p_s, 0.01, 0.49)  # Slip should be < 0.5
            p_g = np.clip(p_g, 0.01, 0.49)  # Guess should be < 0.5

            model = BayesianKnowledgeTracing(p_l0, p_t, p_s, p_g)
            total_ll = 0.0

            for sequence in all_sequences:
                knowledge = p_l0
                for correct in sequence:
                    p_correct = model.predict_next(knowledge)
                    p_correct = np.clip(p_correct, 1e-9, 1 - 1e-9)
                    ll = np.log(p_correct) if correct else np.log(1 - p_correct)
                    total_ll += ll
                    knowledge = model.update_knowledge(knowledge, correct)

            return -total_ll

        result = minimize(
            neg_log_likelihood,
            x0=[self.p_l0, self.p_t, self.p_s, self.p_g],
            method='L-BFGS-B',
            bounds=[(0.01, 0.99), (0.01, 0.99), (0.01, 0.49), (0.01, 0.49)]
        )

        fitted_params = result.x
        self.p_l0, self.p_t, self.p_s, self.p_g = fitted_params

        if verbose:
            print(f"Fitted: P(L0)={self.p_l0:.3f}, P(T)={self.p_t:.3f}, "
                  f"P(S)={self.p_s:.3f}, P(G)={self.p_g:.3f}")
            print(f"Final log-likelihood: {-result.fun:.2f}")

        return {
            'p_l0': float(self.p_l0),
            'p_t': float(self.p_t),
            'p_s': float(self.p_s),
            'p_g': float(self.p_g),
            'log_likelihood': float(-result.fun)
        }

DKT with PyTorch LSTM

import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
import numpy as np
from sklearn.metrics import roc_auc_score

class KnowledgeTracingDataset(Dataset):
    """
    Dataset for knowledge tracing.
    Each sequence: list of (skill_id, correct) tuples.
    """
    def __init__(self, sequences, n_skills, max_seq_len=200):
        self.sequences = sequences
        self.n_skills = n_skills
        self.max_seq_len = max_seq_len

    def __len__(self):
        return len(self.sequences)

    def __getitem__(self, idx):
        seq = self.sequences[idx]

        # Truncate or pad
        seq = seq[:self.max_seq_len]
        seq_len = len(seq)

        # Input: (skill_id, correct) -> one-hot of size 2 * n_skills
        # skill_id correct=1: index = skill_id
        # skill_id correct=0: index = skill_id + n_skills
        input_seq = torch.zeros(self.max_seq_len, 2 * self.n_skills)
        target_seq = torch.full((self.max_seq_len,), -1.0)  # -1 = padding
        target_skill = torch.zeros(self.max_seq_len, dtype=torch.long)

        for t, (skill, correct) in enumerate(seq[:-1]):
            idx_input = skill if correct else skill + self.n_skills
            input_seq[t, idx_input] = 1.0

        for t, (skill, correct) in enumerate(seq[1:]):
            target_seq[t] = float(correct)
            target_skill[t] = skill

        mask = torch.zeros(self.max_seq_len, dtype=torch.bool)
        mask[:seq_len - 1] = True

        return {
            'input': input_seq,
            'target': target_seq,
            'target_skill': target_skill,
            'mask': mask,
            'length': seq_len - 1
        }


class DeepKnowledgeTracing(nn.Module):
    """
    Deep Knowledge Tracing (Piech et al., 2015).
    LSTM over interaction sequences predicts next response probability per skill.
    """
    def __init__(self, n_skills, hidden_size=200, n_layers=1, dropout=0.2):
        super().__init__()
        self.n_skills = n_skills
        self.hidden_size = hidden_size

        self.lstm = nn.LSTM(
            input_size=2 * n_skills,
            hidden_size=hidden_size,
            num_layers=n_layers,
            dropout=dropout if n_layers > 1 else 0,
            batch_first=True
        )
        self.output = nn.Linear(hidden_size, n_skills)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        """
        Args:
            x: (batch, seq_len, 2*n_skills)
        Returns:
            predictions: (batch, seq_len, n_skills) - P(correct) per skill
        """
        h, _ = self.lstm(x)
        h = self.dropout(h)
        logits = self.output(h)
        return torch.sigmoid(logits)


def train_dkt(model, train_loader, val_loader, n_epochs=10, lr=1e-3):
    """Train DKT model."""
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model = model.to(device)
    optimizer = torch.optim.Adam(model.parameters(), lr=lr)
    criterion = nn.BCELoss()

    for epoch in range(n_epochs):
        model.train()
        train_loss = 0

        for batch in train_loader:
            x = batch['input'].to(device)
            targets = batch['target'].to(device)
            target_skills = batch['target_skill'].to(device)
            mask = batch['mask'].to(device)

            optimizer.zero_grad()
            predictions = model(x)  # (batch, seq_len, n_skills)

            # Extract predictions for the specific skill at each time step
            batch_size, seq_len, n_skills = predictions.shape
            skill_preds = predictions.gather(
                2, target_skills.unsqueeze(-1)
            ).squeeze(-1)  # (batch, seq_len)

            # Apply mask (ignore padding)
            valid_preds = skill_preds[mask]
            valid_targets = targets[mask]

            # Only train on non-padding positions
            if valid_preds.numel() == 0:
                continue

            loss = criterion(valid_preds, valid_targets)
            loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
            optimizer.step()
            train_loss += loss.item()

        # Validation AUC
        val_auc = evaluate_dkt(model, val_loader, device)
        avg_loss = train_loss / len(train_loader)
        print(f"Epoch {epoch+1}: loss={avg_loss:.4f}, val_AUC={val_auc:.4f}")

    return model


def evaluate_dkt(model, data_loader, device):
    """Evaluate DKT with AUC on next-question prediction."""
    model.eval()
    all_preds, all_targets = [], []

    with torch.no_grad():
        for batch in data_loader:
            x = batch['input'].to(device)
            targets = batch['target'].to(device)
            target_skills = batch['target_skill'].to(device)
            mask = batch['mask'].to(device)

            predictions = model(x)
            skill_preds = predictions.gather(
                2, target_skills.unsqueeze(-1)
            ).squeeze(-1)

            valid_preds = skill_preds[mask].cpu().numpy()
            valid_targets = targets[mask].cpu().numpy()

            all_preds.extend(valid_preds)
            all_targets.extend(valid_targets)

    if len(set(all_targets)) < 2:
        return 0.5  # Cannot compute AUC with single class

    return roc_auc_score(all_targets, all_preds)

SAKT Transformer-Based Knowledge Tracing

import torch
import torch.nn as nn
import math

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, n_heads):
        super().__init__()
        assert d_model % n_heads == 0
        self.d_k = d_model // n_heads
        self.n_heads = n_heads
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)

    def forward(self, query, key, value, mask=None):
        batch_size = query.shape[0]

        Q = self.W_q(query).view(batch_size, -1, self.n_heads, self.d_k).transpose(1, 2)
        K = self.W_k(key).view(batch_size, -1, self.n_heads, self.d_k).transpose(1, 2)
        V = self.W_v(value).view(batch_size, -1, self.n_heads, self.d_k).transpose(1, 2)

        scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)

        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))

        attn = torch.softmax(scores, dim=-1)
        context = torch.matmul(attn, V)
        context = context.transpose(1, 2).contiguous().view(batch_size, -1,
                                                             self.n_heads * self.d_k)
        return self.W_o(context), attn


class SAKT(nn.Module):
    """
    Self-Attentive Knowledge Tracing (Pandey & Karypis, 2019).
    Uses transformer attention to capture relevant past interactions.
    """
    def __init__(self, n_skills, d_model=128, n_heads=8, dropout=0.2, max_seq_len=200):
        super().__init__()
        self.n_skills = n_skills
        self.d_model = d_model

        # Embeddings: exercise and interaction (exercise x correctness)
        self.exercise_emb = nn.Embedding(n_skills + 1, d_model, padding_idx=0)
        self.interaction_emb = nn.Embedding(2 * n_skills + 1, d_model, padding_idx=0)
        self.pos_emb = nn.Embedding(max_seq_len + 1, d_model)

        self.attention = MultiHeadAttention(d_model, n_heads)
        self.layer_norm1 = nn.LayerNorm(d_model)
        self.layer_norm2 = nn.LayerNorm(d_model)
        self.ffn = nn.Sequential(
            nn.Linear(d_model, d_model * 4),
            nn.ReLU(),
            nn.Linear(d_model * 4, d_model)
        )
        self.dropout = nn.Dropout(dropout)
        self.output = nn.Linear(d_model, 1)

    def forward(self, exercise_seq, interaction_seq):
        """
        Args:
            exercise_seq: (batch, seq_len) - current exercise at each step
            interaction_seq: (batch, seq_len) - past interactions
        Returns:
            predictions: (batch, seq_len) - P(correct) for each step
        """
        seq_len = exercise_seq.shape[1]
        device = exercise_seq.device

        # Position encoding
        positions = torch.arange(seq_len, device=device).unsqueeze(0)

        # Exercise query embedding
        exercise_e = self.exercise_emb(exercise_seq) + self.pos_emb(positions)

        # Past interaction key/value embedding
        interaction_e = self.interaction_emb(interaction_seq) + self.pos_emb(positions)

        # Causal mask: each position can only attend to past positions
        causal_mask = torch.tril(torch.ones(seq_len, seq_len, device=device))

        # Self-attention: query from current exercise, key/value from past interactions
        attn_out, attn_weights = self.attention(exercise_e, interaction_e,
                                                interaction_e, causal_mask)

        # Add & Norm
        out = self.layer_norm1(exercise_e + self.dropout(attn_out))
        out = self.layer_norm2(out + self.dropout(self.ffn(out)))

        predictions = torch.sigmoid(self.output(out)).squeeze(-1)
        return predictions, attn_weights

Real-Time Knowledge State API

from dataclasses import dataclass, field
from typing import List, Dict, Optional, Tuple
import torch
import numpy as np

@dataclass
class StudentKnowledgeState:
    student_id: str
    skill_mastery: Dict[str, float]    # skill_name -> P(mastery)
    interaction_history: List[Tuple]   # (skill_id, correct, timestamp)
    last_updated: float                # Unix timestamp

class RealTimeKnowledgeTracker:
    """
    Real-time knowledge state estimation using BKT (interpretable)
    with optional DKT for improved accuracy.
    """
    def __init__(self, bkt_params: Dict[str, Dict], use_dkt: bool = False):
        """
        Args:
            bkt_params: {skill_name: {p_l0, p_t, p_s, p_g}}
            use_dkt: whether to use DKT for sequence-level predictions
        """
        self.bkt_params = bkt_params
        self.use_dkt = use_dkt
        self.student_states: Dict[str, StudentKnowledgeState] = {}

    def get_or_create_state(self, student_id: str) -> StudentKnowledgeState:
        if student_id not in self.student_states:
            self.student_states[student_id] = StudentKnowledgeState(
                student_id=student_id,
                skill_mastery={
                    skill: params['p_l0']
                    for skill, params in self.bkt_params.items()
                },
                interaction_history=[],
                last_updated=0.0
            )
        return self.student_states[student_id]

    def record_interaction(
        self,
        student_id: str,
        skill: str,
        correct: bool,
        timestamp: float
    ) -> float:
        """
        Record an interaction and update knowledge state.
        Returns updated P(mastery) for the skill.
        """
        state = self.get_or_create_state(student_id)
        state.interaction_history.append((skill, correct, timestamp))
        state.last_updated = timestamp

        if skill not in self.bkt_params:
            return state.skill_mastery.get(skill, 0.5)

        params = self.bkt_params[skill]
        model = BayesianKnowledgeTracing(**params)
        current_mastery = state.skill_mastery.get(skill, params['p_l0'])
        updated_mastery = model.update_knowledge(current_mastery, correct)
        state.skill_mastery[skill] = updated_mastery

        return updated_mastery

    def predict_performance(
        self,
        student_id: str,
        skill: str
    ) -> float:
        """Predict P(correct) on the next item for a given skill."""
        state = self.get_or_create_state(student_id)
        if skill not in self.bkt_params:
            return 0.5

        params = self.bkt_params[skill]
        mastery = state.skill_mastery.get(skill, params['p_l0'])
        model = BayesianKnowledgeTracing(**params)
        return model.predict_next(mastery)

    def get_zpd_skills(
        self,
        student_id: str,
        mastery_threshold: float = 0.95,
        zpd_lower: float = 0.4,
        zpd_upper: float = 0.85
    ) -> Dict[str, str]:
        """
        Classify all skills into mastered, ZPD, and too-hard.

        Returns:
            dict of {skill: category} where category is
            'mastered', 'zpd', or 'too_hard'
        """
        state = self.get_or_create_state(student_id)
        categories = {}

        for skill, mastery in state.skill_mastery.items():
            if mastery >= mastery_threshold:
                categories[skill] = 'mastered'
            elif mastery >= zpd_lower:
                categories[skill] = 'zpd'
            else:
                categories[skill] = 'too_hard'

        return categories

Production Engineering Notes

Use BKT for small data, DKT/SAKT for large data. BKT requires only tens of student-skill interaction sequences to get reasonable parameter estimates. DKT and SAKT need hundreds or thousands of student sequences to learn useful representations. Early in a platform's lifecycle, BKT is more reliable. As data accumulates, switch to deep models with BKT estimates as a fallback.

Real-time latency requirements drive model choice. BKT updates are $O(1)$ per interaction (four simple arithmetic operations). DKT requires an LSTM forward pass over the full interaction history, which grows with student tenure. SAKT is similarly quadratic in sequence length for the full attention computation. In production, cache the hidden state and run incremental updates rather than recomputing from scratch.

Monitor for knowledge tracing drift. If a model trained on older interaction data is deployed on a newer cohort, there may be distributional shift (curriculum changed, student demographics changed). Monitor prediction calibration: the average predicted probability of correctness should approximately equal the average actual correctness rate in recent data. Recalibrate or retrain when calibration drifts.

Concept graphs improve model quality. Knowledge tracing models that treat skills as independent miss transfer effects: learning algebra helps with calculus. Incorporating prerequisite graphs as inductive bias improves predictions on concepts with few direct observations.

Common Mistakes

:::danger Using Next-Step AUC as the Only Evaluation Metric High AUC on next-step prediction does not mean the model is correctly inferring knowledge state. A model can achieve 0.80 AUC by simply learning which items are hard and which are easy, without modeling student-level knowledge dynamics at all. Always evaluate whether the model's predictions improve over the course of a student's learning trajectory: does predicted mastery increase as students practice? Does the model distinguish students who learned recently from students who last practiced months ago? :::

:::danger Applying BKT to Multi-Dimensional Skills BKT assumes a single binary latent skill. If a skill is actually multi-dimensional (e.g., "algebra" encompasses equation solving, graphing, word problems, and function notation), BKT's P(mastery) is a meaningless blend of these sub-skills. Either decompose into fine-grained skills before fitting BKT, or use a model that handles multiple knowledge dimensions. :::

:::warning DKT Can Leak Future Information The original DKT paper's training objective did not enforce causal masking - some implementations inadvertently allow the model to use future observations to predict current performance. Always use strictly causal attention (lower triangular mask) or shift input sequences by one time step to ensure the model only uses past interactions to predict the next one. Test for future leakage by evaluating on a sequence reversal: if the model achieves similar AUC on reversed sequences (which makes no causal sense), it is not using temporal order correctly. :::

:::warning Knowledge Tracing Metrics Depend Heavily on Dataset AUC of 0.80 on ASSISTments 2009 is very different from AUC of 0.80 on EdNet - different skill granularities, student populations, and problem types make cross-dataset comparison unreliable. Always include a baseline (predict average correctness rate per skill) and report improvement over baseline in addition to absolute AUC. :::

Interview Questions and Answers

Q1: What are the four parameters of BKT and what does each control?

$P(L_0)$ is the initial knowledge probability - the probability a student already knows the skill before any practice. This reflects prior knowledge, set by assessing the student or using population defaults. $P(T)$ is the learning rate - the probability that a student who does not know the skill learns it on any given practice attempt. Higher $P(T)$ means the skill is learned quickly. $P(S)$ is the slip probability - the probability that a student who knows the skill answers incorrectly (a mistake or careless error). Should typically be below 0.1. $P(G)$ is the guess probability - the probability that a student who does not know the skill answers correctly (by guessing, especially relevant for MCQs). For four-option MCQs with no prior knowledge, $P(G) = 0.25$ is the floor.

These four parameters fully determine the BKT forward algorithm. They are fit per skill via EM, which means each skill has its own learning rate and initial probability - reflecting the fact that some skills are learned quickly and some are learned slowly.

Q2: DKT outperforms BKT by a large margin on benchmarks. Should we always use DKT in production?

Not necessarily. DKT requires sufficient data to learn useful representations - typically hundreds or thousands of student-skill interaction sequences. For a new platform with few students, BKT with domain-expert parameter initialization may outperform DKT because DKT will overfit to the small training set.

DKT's black-box nature also creates practical problems: teachers and students cannot understand why the model estimates low mastery for a concept. BKT's four parameters per skill can be explained to stakeholders. For platforms where model explainability matters for teacher trust, BKT may be preferred even if DKT performs better on held-out AUC.

A practical production architecture: use BKT as the baseline with human-interpretable skill mastery reports, and use DKT in parallel to generate predictions for the adaptive engine. If DKT performs statistically significantly better over a test period, switch primary reliance to DKT for sequencing while keeping BKT for teacher-facing skill reports.

Q3: What is the difference between SAKT and DKT in terms of how they model the interaction sequence?

DKT uses an LSTM that processes interactions sequentially. The hidden state $\mathbf{h}_t$ is a compressed representation of all past interactions. The LSTM has recurrent connections so information propagates through time, but long-range dependencies can be lost due to vanishing gradients.

SAKT uses transformer self-attention. For each position in the sequence, attention computes a weighted sum of all past interactions, where the weights reflect relevance to the current question. SAKT can directly capture that performance on a specific past question is relevant to predicting the current question, even if they are 50 steps apart - something LSTM struggles with. SAKT also trains in parallel (vs sequential LSTM), which is much faster.

The tradeoff: SAKT's attention is quadratic in sequence length ( $O(T^2)$ computation), which becomes expensive for very long interaction histories. For students with thousands of interactions, caching or sequence compression techniques are needed.

Q4: How do you handle the cold start problem for a new student with no interaction history?

For BKT: use population-level $P(L_0)$ estimates per skill. These can be refined by grade level, prior course completion, or a brief diagnostic assessment.

For DKT/SAKT: the model needs at least a few interactions to produce meaningful predictions. Until you have at least 5-10 interactions with a student, fall back to BKT with population priors, then switch to the deep model once sufficient history exists.

A better approach: run a calibration mini-assessment when a new student joins - 5-10 carefully selected items across the skill space (similar to CAT from the Adaptive Learning lesson). Use these responses to establish initial mastery estimates, then proceed with the full adaptive system.

Q5: How would you add forgetting to the BKT model?

Standard BKT assumes no forgetting - knowledge transitions are one-directional from unknown to known. To add forgetting, extend the transition matrix:

$P(L_{t+1} = 1 | L_t = 1) = 1 - P(F)$ $P(L_{t+1} = 0 | L_t = 1) = P(F)$

where $P(F)$ is the forgetting rate. This makes the model a full two-state HMM with bidirectional transitions.

A more principled approach is to make $P(F)$ time-dependent:

$P(F|\Delta t) = 1 - e^{-\lambda \Delta t}$

where $\Delta t$ is the time elapsed since last practice and $\lambda$ is a decay rate parameter. This is essentially the Ebbinghaus forgetting curve embedded in BKT. The DAS3H model (2019) implements this with a parametric forgetting function fit from interaction timestamps.

Q6: The ASSISTments and EdNet datasets are commonly used for knowledge tracing benchmarks. What are their key differences?

ASSISTments 2009-2010 (Feng et al.) is the most widely used benchmark: roughly 325,000 interactions from US middle school students on math skills, with 26 distinct knowledge components. It is relatively small by modern standards and covers a narrow domain.

EdNet (Choi et al., 2020) is substantially larger: over 131 million interactions from 780,000 students on an online SAT preparation platform, with 1,000+ knowledge concepts across math and English. EdNet better reflects production-scale knowledge tracing but introduces challenges from the SAT prep domain specificity.

Junyi Academy (another common benchmark) covers Taiwanese K-12 math with rich temporal information including exact timestamps, enabling research on forgetting effects.

When evaluating a new knowledge tracing model, reporting on multiple benchmarks is important because models that perform best on ASSISTments do not always rank the same on EdNet - the datasets have different skill granularity, interaction density, and student population characteristics.

Summary

Knowledge tracing models estimate the probability that a student has mastered a skill from their interaction history. BKT's hidden Markov model is interpretable and data-efficient but assumes binary knowledge and no forgetting. DKT's LSTM learns continuous knowledge dynamics from data but requires substantial interaction history and is opaque. DKVMN adds interpretable per-concept memory. SAKT uses self-attention for better long-range dependency capture. AKT integrates Rasch model difficulty parameters for theoretically grounded attention weights. The choice of model depends on data availability, interpretability requirements, and latency constraints. Evaluation must go beyond next-step AUC to verify that the model actually tracks learning dynamics and not just item difficulty patterns.

Opening: The Problem of Invisible Knowledge​

Why This Exists: From Score Aggregates to Knowledge State Estimates​

Historical Context: Three Decades of Student Modeling​

Core Concepts​

Bayesian Knowledge Tracing (BKT)​

Deep Knowledge Tracing (DKT)​

DKVMN: Dynamic Key-Value Memory Networks​

SAKT: Self-Attentive Knowledge Tracing​

AKT: Attentive Knowledge Tracing with Rasch​

Forgetting in Knowledge Tracing​

Evaluation: AUC on Next-Question Prediction​

Mermaid Diagram: Knowledge Tracing Model Evolution​

Code Examples​

BKT Implementation with Forward-Backward Algorithm​

DKT with PyTorch LSTM​

SAKT Transformer-Based Knowledge Tracing​

Real-Time Knowledge State API​

Production Engineering Notes​

Common Mistakes​

Interview Questions and Answers​

Summary​