Skip to main content

AI-Powered Assessment

Reading time: ~42 min · Interview relevance: High · Target roles: ML Engineer, NLP Engineer, EdTech Engineer

Opening: The Grading Bottleneck

A professor at a large state university teaches an introductory economics course. The class has 800 students. Every week she assigns a short-answer response: explain in three to five sentences why rent control reduces housing supply. Every week, 800 responses come in. She has two teaching assistants, each working 20 hours per week on the course. By the time a student gets feedback, the next assignment is already due. The feedback is cursory. Students who would benefit most from detailed correction of their reasoning never get it.

This is the grading bottleneck. It is not unique to large lectures. It afflicts middle school teachers grading writing assignments, online courses with thousands of learners, language learning apps where speaking practice produces open-ended responses, and standardized testing programs that need consistent scoring across millions of test-takers. Human grading is accurate but slow, expensive, and inconsistent - the same essay handed to different graders, or to the same grader at different times, often receives different scores.

Automated assessment is the attempt to break this bottleneck. The first serious attempt came in 1966 when Ellis Page built Project Essay Grade - a system that extracted surface features like word count, comma count, and average word length from essays, then regressed these against human scores. It worked surprisingly well. The insight was uncomfortable: many features of writing quality correlate with mechanical surface features, not just deep semantic content.

The field evolved through rule-based systems, feature engineering, and eventually transformer-based neural models. Today, state-of-the-art automated essay scoring systems achieve human-level agreement on holistic scoring tasks. But "human-level agreement" masks important failures: automated systems can be gamed by fluent nonsense, they inherit biases from their training data, and they remain brittle on out-of-distribution writing styles. Understanding these failure modes is as important as understanding the algorithms.

Modern automated assessment extends beyond essays. Short answer grading requires semantic understanding of key concepts. Code assessment requires execution and static analysis. Feedback generation requires LLMs that can be specific and pedagogically sound without giving away answers. Plagiarism detection has expanded from string matching to semantic similarity and LLM-generated text detection. Each of these is a distinct ML problem with distinct failure modes.


Why This Exists: The Cost of Manual Grading at Scale

The economics are straightforward. A trained human rater can score 10-15 essays per hour for holistic quality. At a fully-loaded labor cost of 25/hour,scoringa1000studentmidtermcosts25/hour, scoring a 1000-student midterm costs 1,500-2,500,takesdaysofcalendartime,andintroducesinterraterreliabilityproblemswhenmultiplegradersareinvolved.Interratercorrelationforessayscoringbyhumansistypically2,500, takes days of calendar time, and introduces inter-rater reliability problems when multiple graders are involved. Inter-rater correlation for essay scoring by humans is typically r = 0.7toto0.8$.

For standardized testing, the scale is even more extreme. The SAT writing section was scored by human raters for decades at enormous cost. Moving to automated scoring, or using automated scores as a first-pass with human review of borderline cases, fundamentally changes the economics.

The secondary argument is equity. Automated feedback that is immediate and specific can reach students who lack access to private tutors or teachers with time for individual attention. A student from a rural school district with no writing center can get detailed, actionable feedback on an essay within seconds of submission. If the system is good, this reduces rather than amplifies educational inequality.

The risks are real. Automated scores can be unfair to students whose writing styles differ from the training data - non-native English speakers, students using African American Vernacular English, students from educational systems with different rhetorical conventions. Consequential decisions based on automated scores require exceptional fairness standards.


Historical Context: From Project Essay Grade to Transformers

1966 - Project Essay Grade (PEG): Ellis Page regressed surface features against human scores. Demonstrated that automated scoring was feasible for holistic quality. The features were purely mechanical - not semantic.

1998 - e-rater (ETS): Educational Testing Service deployed e-rater for scoring GMAT analytical writing. e-rater used linguistic features: discourse structure, syntactic variety, vocabulary sophistication, usage errors. It was the first production AES system for high-stakes testing.

2012 - ASAP Competition (Kaggle): The Automated Student Assessment Prize released 12,978 essays with human scores across 8 prompts. This dataset democratized AES research. Quadratic Weighted Kappa (QWK) became the standard metric. Winning systems used random forests over hand-crafted features.

2016 - Neural AES: Taghipour and Ng published LSTM-based AES that outperformed feature-engineered systems on the ASAP benchmark. End-to-end learning from raw text beat hand-crafted features.

2019-2020 - BERT-based AES: Fine-tuning BERT on essay scoring tasks pushed performance further. The key insight was that pre-trained language models capture discourse coherence, argumentation quality, and vocabulary sophistication in their representations without feature engineering.

2023+ - LLM-based Feedback: GPT-4 and Claude can generate rubric-aligned feedback indistinguishable from expert feedback in blind evaluations. The challenge shifted from "can we score it?" to "can we score it fairly and generate feedback that is actionable without being harmful?"


Core Concepts

Feature-Based Automated Essay Scoring

Before neural methods, AES relied on hand-crafted features in several categories:

Text Statistics: word count, sentence count, average sentence length, vocabulary size, type-token ratio (TTR = unique words / total words), average word length. These capture verbosity and vocabulary breadth.

Discourse Structure: presence of introduction/conclusion paragraphs, paragraph count, use of discourse connectives ("however", "therefore", "in contrast"). These capture organizational quality.

Syntactic Features: parse tree depth, clause count per sentence, passive voice ratio, subordinate clause frequency. These require a parser and capture syntactic sophistication.

Error Detection: spelling error rate, grammar error count from rule-based systems. Correlated strongly with scores in early systems because surface errors are highly visible to human raters.

Lexical Sophistication: mean word frequency rank (rarer words indicate higher vocabulary level), proportion of Academic Word List terms, Word Maturity scores.

A gradient boosting model over these features achieves QWK 0.70\approx 0.70 on the ASAP benchmark. The fundamental limitation: features capture correlation with quality but not causation. A student can score well by writing long, complex-sounding sentences that say nothing meaningful.

Neural Automated Essay Scoring

LSTM-based AES (Taghipour and Ng, 2016) treats the essay as a token sequence:

  1. Embed each word: et=Embedding(wt)\mathbf{e}_t = \text{Embedding}(w_t)
  2. Encode with bidirectional LSTM: ht=BiLSTM(ht1,et)\mathbf{h}_t = \text{BiLSTM}(\mathbf{h}_{t-1}, \mathbf{e}_t)
  3. Mean pool over time: v=1Ttht\mathbf{v} = \frac{1}{T}\sum_t \mathbf{h}_t
  4. Predict score: s^=σ(wTv)(smaxsmin)+smin\hat{s} = \sigma(\mathbf{w}^T \mathbf{v}) \cdot (s_{max} - s_{min}) + s_{min}

BERT-based AES (dominant since 2019):

  1. Tokenize essay, handle BERT's 512 token limit via truncation or hierarchical encoding
  2. Encode: H=BERT([CLS]+tokens)\mathbf{H} = \text{BERT}([\text{CLS}] + \text{tokens})
  3. Take [CLS] token: v=H0\mathbf{v} = \mathbf{H}_0
  4. Regression head: s^=Linear(v)\hat{s} = \text{Linear}(\mathbf{v})

Quadratic Weighted Kappa (QWK) - the standard AES metric:

κ=1i,jWijOiji,jWijEij\kappa = 1 - \frac{\sum_{i,j} W_{ij} O_{ij}}{\sum_{i,j} W_{ij} E_{ij}}

where OijO_{ij} is the observed agreement matrix, EijE_{ij} is expected agreement under independence, and Wij=(ij)2(N1)2W_{ij} = \frac{(i-j)^2}{(N-1)^2} is the quadratic disagreement weight. QWK of 0.80+ is considered good; human-human QWK on ASAP is typically 0.75-0.85.

Rubric-Based Multi-Dimensional Scoring

Holistic scoring (a single quality score) is simpler but gives students no diagnostic information. Rubric-based scoring assigns separate scores to multiple dimensions: content, organization, style, grammar. This is more useful for feedback but harder to train.

Multi-task learning with BERT handles this naturally:

  • Shared encoder for all rubric dimensions
  • Separate regression head per dimension
  • Joint loss: L=kλkLk\mathcal{L} = \sum_k \lambda_k \mathcal{L}_k

The rubric alignment challenge: a criterion like "develops argument with specific evidence" requires semantic understanding of what counts as "specific evidence." This is where pre-trained language model representations outperform hand-crafted features.

Short Answer Grading

Short answer grading differs from essay scoring: answers are short (1-3 sentences) and there is typically a reference answer or rubric points to compare against.

Semantic Similarity Approach: Embed both student answer and reference answer with a sentence transformer, compute cosine similarity, threshold to assign a score. Simple and fast, works well when paraphrases are close in embedding space.

NLI-based Approach: Frame as textual entailment - does the student answer entail the key points of the reference answer? Fine-tuned cross-encoders (e.g., DeBERTa on NLI) directly compare the two texts and output entailment probability.

Rubric Coverage Approach: Define key concepts that must appear in the correct answer. Score based on how many rubric points are semantically covered. This is more robust to phrasing variation than reference-similarity because it checks for specific information rather than overall similarity.

Automated Feedback Generation with LLMs

LLMs can generate rubric-aligned feedback, but several constraints must be enforced in the prompt:

  1. Specificity: feedback must reference specific parts of the student's essay
  2. Accuracy: feedback must be factually correct about the essay content
  3. Pedagogical soundness: guide toward the answer, do not give the answer
  4. Grade-level appropriateness: vocabulary matched to student level

Structured JSON output from LLMs allows downstream validation and rendering per rubric dimension.

Feedback quality criteria:

  • Specificity: does it reference a specific paragraph, sentence, or claim?
  • Actionability: does it tell the student what to do differently?
  • Accuracy: is it factually correct about the essay content?
  • Non-revealing: does it avoid giving the student the correct answer to copy?

Plagiarism Detection

Fingerprint-based detection: Hash character n-grams from the document, query a database of known hashes. Fast and scalable, catches exact and near-exact copies. Fails on paraphrased plagiarism.

Semantic similarity detection: Embed candidate document and corpus documents with sentence transformers, find nearest neighbors. Catches paraphrase plagiarism that fingerprinting misses. Computationally more expensive.

LLM-generated text detection: A new problem since 2023. Classifier-based detectors achieve 70-85% accuracy but have 3-8% false positive rates - unacceptable for accusations in high-stakes contexts. Watermarking (embedding statistical patterns in LLM token choices) is more reliable but requires LLM cooperation.

Bias in Automated Scoring

Automated scoring systems inherit biases from their training data:

Demographic bias: Training on essays from native English speakers means systematic underestimation of non-native speakers and writers from different rhetorical traditions.

Topic bias: Models trained on specific prompts may not generalize. A student writing about an unfamiliar topic uses simpler vocabulary; the model may confuse this with lower quality.

Gaming the system: Students who learn what features drive the score can optimize for surface features rather than content quality. Longer essays, more complex vocabulary, and better paragraph structure can inflate scores without improving argument quality.

Audit methods: Compute QWK separately by demographic group. Apply the 4/5ths rule (disparate impact test): if any group's average score is below 80% of the highest-scoring group's average, flag for investigation.


Mermaid Diagram: AI Assessment Pipeline


Code Examples

BERT-Based Essay Scoring

import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from transformers import BertTokenizer, BertModel
import numpy as np
from sklearn.metrics import cohen_kappa_score

class EssayDataset(Dataset):
def __init__(self, essays, scores, tokenizer, max_length=512):
self.essays = essays
self.scores = scores
self.tokenizer = tokenizer
self.max_length = max_length

def __len__(self):
return len(self.essays)

def __getitem__(self, idx):
encoding = self.tokenizer(
self.essays[idx],
max_length=self.max_length,
truncation=True,
padding='max_length',
return_tensors='pt'
)
return {
'input_ids': encoding['input_ids'].squeeze(),
'attention_mask': encoding['attention_mask'].squeeze(),
'score': torch.tensor(self.scores[idx], dtype=torch.float)
}


class BERTEssayScorer(nn.Module):
"""
BERT fine-tuned for essay scoring.
Outputs a holistic score normalized to [0, 1].
"""
def __init__(self, bert_model_name='bert-base-uncased', dropout=0.1):
super().__init__()
self.bert = BertModel.from_pretrained(bert_model_name)
self.dropout = nn.Dropout(dropout)
self.regressor = nn.Linear(self.bert.config.hidden_size, 1)

def forward(self, input_ids, attention_mask):
outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
cls_output = outputs.last_hidden_state[:, 0, :] # [CLS] token
cls_output = self.dropout(cls_output)
score = torch.sigmoid(self.regressor(cls_output))
return score.squeeze()


class RubricScoringModel(nn.Module):
"""
Multi-task BERT for rubric-based scoring.
Separate head per rubric dimension, shared encoder.
"""
def __init__(self, rubric_dimensions, bert_model_name='bert-base-uncased', dropout=0.1):
super().__init__()
self.bert = BertModel.from_pretrained(bert_model_name)
self.dropout = nn.Dropout(dropout)
# Separate regression head per dimension
self.rubric_heads = nn.ModuleDict({
dim: nn.Linear(self.bert.config.hidden_size, 1)
for dim in rubric_dimensions
})

def forward(self, input_ids, attention_mask):
outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
cls_output = self.dropout(outputs.last_hidden_state[:, 0, :])
scores = {
dim: torch.sigmoid(head(cls_output)).squeeze()
for dim, head in self.rubric_heads.items()
}
return scores


def quadratic_weighted_kappa(y_true, y_pred, min_rating, max_rating):
"""
Quadratic Weighted Kappa - standard metric for AES.
Penalizes large disagreements more than small ones.
"""
y_true = np.clip(np.round(y_true).astype(int), min_rating, max_rating)
y_pred = np.clip(np.round(y_pred).astype(int), min_rating, max_rating)
return cohen_kappa_score(y_true, y_pred, weights='quadratic')


def train_essay_scorer(model, train_loader, val_loader, score_min, score_max,
n_epochs=10, lr=2e-5):
"""Fine-tune BERT essay scorer with MSE loss."""
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=lr, weight_decay=0.01)
criterion = nn.MSELoss()
best_qwk = -1

for epoch in range(n_epochs):
model.train()
train_loss = 0
for batch in train_loader:
input_ids = batch['input_ids'].to(device)
attention_mask = batch['attention_mask'].to(device)
# Normalize scores to [0, 1]
scores = ((batch['score'] - score_min) /
(score_max - score_min)).to(device)

optimizer.zero_grad()
pred = model(input_ids, attention_mask)
loss = criterion(pred, scores)
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()
train_loss += loss.item()

# Validation
model.eval()
all_preds, all_targets = [], []
with torch.no_grad():
for batch in val_loader:
input_ids = batch['input_ids'].to(device)
attention_mask = batch['attention_mask'].to(device)
pred = model(input_ids, attention_mask).cpu().numpy()
# Denormalize to original scale
pred_scores = pred * (score_max - score_min) + score_min
all_preds.extend(pred_scores)
all_targets.extend(batch['score'].numpy())

qwk = quadratic_weighted_kappa(
np.array(all_targets), np.array(all_preds),
score_min, score_max
)
avg_loss = train_loss / len(train_loader)
print(f"Epoch {epoch+1}: loss={avg_loss:.4f}, QWK={qwk:.4f}")

if qwk > best_qwk:
best_qwk = qwk
torch.save(model.state_dict(), 'best_essay_scorer.pt')

return best_qwk

Rubric Alignment with Semantic Similarity

from sentence_transformers import SentenceTransformer, util
import torch
import numpy as np
from typing import Dict, List, Tuple

class RubricAligner:
"""
Align essay content to rubric criteria using semantic similarity.
Identifies which rubric criteria are addressed in each essay paragraph.
"""
def __init__(self, model_name='all-mpnet-base-v2'):
self.model = SentenceTransformer(model_name)

def align_to_rubric(
self,
essay: str,
rubric_criteria: Dict[str, str],
threshold: float = 0.4
) -> Dict[str, Dict]:
"""
For each rubric criterion, find the essay segments that address it.

Args:
essay: full essay text
rubric_criteria: {criterion_name: criterion_description}
threshold: minimum similarity to count as addressing a criterion

Returns:
For each criterion: {
'addressed': bool,
'best_match_score': float,
'best_match_paragraph': str
}
"""
paragraphs = [p.strip() for p in essay.split('\n') if len(p.strip()) > 30]
if not paragraphs:
paragraphs = [essay]

# Encode all paragraphs and criteria
para_embeddings = self.model.encode(paragraphs, convert_to_tensor=True)
criteria_embeddings = self.model.encode(
list(rubric_criteria.values()),
convert_to_tensor=True
)

# Compute similarity matrix: (n_criteria, n_paragraphs)
sim_matrix = util.cos_sim(criteria_embeddings, para_embeddings)

results = {}
for i, (criterion_name, criterion_desc) in enumerate(rubric_criteria.items()):
sims = sim_matrix[i] # Similarity to each paragraph
best_idx = int(sims.argmax())
best_score = float(sims[best_idx])

results[criterion_name] = {
'addressed': best_score >= threshold,
'best_match_score': best_score,
'best_match_paragraph': paragraphs[best_idx] if paragraphs else '',
'coverage_score': float((sims >= threshold).float().mean())
}

return results

def compute_rubric_score(
self,
alignment_results: Dict[str, Dict],
criterion_weights: Dict[str, float] = None
) -> float:
"""
Compute overall rubric score from alignment results.
Returns score in [0, 1].
"""
criteria = list(alignment_results.keys())
if criterion_weights is None:
criterion_weights = {c: 1.0 / len(criteria) for c in criteria}

score = sum(
criterion_weights.get(c, 1.0 / len(criteria)) *
alignment_results[c]['best_match_score']
for c in criteria
)
return min(1.0, score)

LLM Feedback Generation with Structured Prompts

import json
from typing import Dict, Optional

FEEDBACK_PROMPT = """You are an expert writing teacher providing feedback on a student essay.

Essay Prompt: {essay_prompt}

Student Grade Level: {grade_level}

Rubric Dimensions and Current Scores:
{rubric_scores}

Student Essay:
---
{essay_text}
---

Provide specific, actionable feedback for each rubric dimension. Rules:
1. Reference a specific part of the essay (paragraph number, a sentence, a claim).
2. Do NOT rewrite the essay or give the student text to copy.
3. Guide toward improvement with a question or suggestion.
4. Keep each feedback item under 60 words.
5. Use vocabulary appropriate for {grade_level} students.

Return valid JSON with rubric dimension names as keys and feedback strings as values.
Add an "overall" key with a 1-2 sentence encouraging summary.

JSON:"""

def generate_essay_feedback(
essay_text: str,
essay_prompt: str,
rubric_scores: Dict[str, float],
grade_level: str,
llm_client,
model: str = "gpt-4o"
) -> Dict[str, str]:
"""
Generate rubric-aligned feedback using an LLM.

Args:
essay_text: the student's essay
essay_prompt: the original writing prompt
rubric_scores: dict of {dimension: score_0_to_4}
grade_level: e.g., "8th grade"
llm_client: initialized OpenAI client
model: LLM model to use

Returns:
dict of {dimension: feedback_string, "overall": summary}
"""
rubric_scores_str = "\n".join(
f"- {dim}: {score:.1f}/4.0" for dim, score in rubric_scores.items()
)

prompt = FEEDBACK_PROMPT.format(
essay_prompt=essay_prompt,
grade_level=grade_level,
rubric_scores=rubric_scores_str,
essay_text=essay_text[:3000] # Limit very long essays
)

response = llm_client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
temperature=0.3,
max_tokens=600,
response_format={"type": "json_object"}
)

try:
feedback = json.loads(response.choices[0].message.content)
except json.JSONDecodeError:
feedback = {"overall": "Feedback generation failed - flagged for human review."}

return feedback


def validate_feedback(feedback: Dict[str, str]) -> Dict[str, bool]:
"""
Check feedback quality before delivering to student.
Flags: too short, no specific reference, may be revealing the answer.
"""
flags = {}
answer_reveal_phrases = [
"the answer is", "you should write", "correct answer",
"you need to say", "the right response"
]

for dimension, text in feedback.items():
if not text or not isinstance(text, str):
flags[f"{dimension}_empty"] = True
continue

word_count = len(text.split())
flags[f"{dimension}_too_short"] = word_count < 15

specific_refs = any(phrase in text.lower() for phrase in [
"paragraph", "sentence", "when you", "in your",
"your use", "you wrote", "you mention"
])
flags[f"{dimension}_no_specific_reference"] = not specific_refs

text_lower = text.lower()
flags[f"{dimension}_may_reveal_answer"] = any(
phrase in text_lower for phrase in answer_reveal_phrases
)

return {k: v for k, v in flags.items() if v}

Plagiarism Detection with Sentence Transformers

import hashlib
import numpy as np
from sentence_transformers import SentenceTransformer
from typing import List, Tuple, Dict

class PlagiarismDetector:
"""
Two-stage plagiarism detection:
1. Fast n-gram fingerprinting for exact/near-exact copies
2. Semantic similarity for paraphrase detection
"""
def __init__(self, model_name='all-MiniLM-L6-v2', semantic_threshold=0.82):
self.model = SentenceTransformer(model_name)
self.semantic_threshold = semantic_threshold
self.corpus_fps: Dict[str, set] = {}
self.corpus_embeds: Dict[str, np.ndarray] = {}

def _ngram_fingerprint(self, text: str, n: int = 5) -> set:
text = text.lower().replace('\n', ' ')
hashes = set()
for i in range(len(text) - n + 1):
gram = text[i:i+n]
hashes.add(hashlib.md5(gram.encode()).hexdigest()[:8])
return hashes

def _jaccard(self, s1: set, s2: set) -> float:
if not s1 or not s2:
return 0.0
return len(s1 & s2) / len(s1 | s2)

def add_to_corpus(self, doc_id: str, text: str):
"""Add a document to the reference corpus."""
self.corpus_fps[doc_id] = self._ngram_fingerprint(text)
sentences = [s.strip() for s in text.split('.') if len(s.strip()) > 20]
if sentences:
self.corpus_embeds[doc_id] = self.model.encode(sentences)

def check(self, text: str) -> List[Tuple[str, float, str]]:
"""
Check a document for plagiarism.
Returns list of (doc_id, score, method) sorted by score descending.
"""
results = []
candidate_fp = self._ngram_fingerprint(text)

# Stage 1: fingerprint
for doc_id, doc_fp in self.corpus_fps.items():
sim = self._jaccard(candidate_fp, doc_fp)
if sim > 0.25:
results.append((doc_id, sim, "fingerprint"))

already_flagged = {r[0] for r in results}

# Stage 2: semantic similarity
candidate_sentences = [s.strip() for s in text.split('.') if len(s.strip()) > 20]
if candidate_sentences:
c_embeds = self.model.encode(candidate_sentences)
for doc_id, doc_embeds in self.corpus_embeds.items():
if doc_id in already_flagged:
continue
# Cosine similarity between all sentence pairs
norms_c = np.linalg.norm(c_embeds, axis=1, keepdims=True)
norms_d = np.linalg.norm(doc_embeds, axis=1, keepdims=True)
sims = (c_embeds @ doc_embeds.T) / (norms_c * norms_d.T + 1e-9)
max_sim = float(sims.max())
if max_sim > self.semantic_threshold:
results.append((doc_id, max_sim, "semantic"))

results.sort(key=lambda x: -x[1])
return results

Bias Audit for Scoring Model

import pandas as pd
import numpy as np
from scipy import stats
from typing import Dict, List

def demographic_bias_audit(
df: pd.DataFrame,
score_col: str,
human_score_col: str,
demographic_cols: List[str]
) -> Dict:
"""
Audit automated scoring system for demographic bias.

Checks score distribution shift, mean error by group,
statistical significance, and disparate impact ratio.
"""
df = df.copy()
df['error'] = df[score_col] - df[human_score_col]
df['abs_error'] = df['error'].abs()

results = {}
for col in demographic_cols:
groups = df[col].dropna().unique()
group_stats = {}

for group in groups:
subset = df[df[col] == group]
group_stats[str(group)] = {
'n': len(subset),
'mean_model_score': float(subset[score_col].mean()),
'mean_human_score': float(subset[human_score_col].mean()),
'mean_abs_error': float(subset['abs_error'].mean()),
'mean_bias': float(subset['error'].mean()),
'std_error': float(subset['error'].std())
}

# ANOVA: are mean model scores significantly different across groups?
group_score_arrays = [df[df[col] == g][score_col].dropna().values
for g in groups]
f_stat, p_value = stats.f_oneway(*group_score_arrays)

# Disparate impact: ratio of lowest to highest mean score
mean_scores = [s['mean_model_score'] for s in group_stats.values()]
disparate_impact = (min(mean_scores) / max(mean_scores)
if max(mean_scores) > 0 else 1.0)

# Most biased group
max_bias_group = max(
group_stats.items(),
key=lambda x: abs(x[1]['mean_bias'])
)

results[col] = {
'group_stats': group_stats,
'anova_p_value': float(p_value),
'significant_difference': bool(p_value < 0.05),
'disparate_impact_ratio': float(disparate_impact),
'disparate_impact_flag': bool(disparate_impact < 0.8),
'most_biased_group': max_bias_group[0],
'max_bias_magnitude': float(abs(max_bias_group[1]['mean_bias']))
}

return results

Production Engineering Notes

Always use human-in-the-loop for high-stakes decisions. Automated scores for practice and formative feedback are lower stakes than scores that affect grades or placement. For summative assessment, flag borderline cases - typically defined as within one scoring band of a decision threshold - for human review. The human review rate should be set conservatively and monitored over time.

Build a feedback loop from human corrections. When human reviewers override automated scores, that is labeled data. Log every correction with metadata and retrain periodically. Track correction patterns: if the model is systematically wrong on essays from a particular prompt or demographic, that tells you what training data to collect.

Monitor score distribution shift continuously. The distribution of automated scores should match historical human score distributions. Set alerts when mean automated score for a prompt drifts more than 0.5 points from the historical mean, or when the variance changes significantly.

Protect against adversarial gaming. After deployment, analyze whether student writing behavior is changing in ways consistent with gaming: increasing length without content improvement, adding academic vocabulary without coherent use, padding paragraphs. Introduce adversarial examples in retraining and hold periodic blind evaluations against human graders.

Store original submissions permanently and immutably. Plagiarism disputes, grade appeals, and fairness audits require access to the exact original submission. Log submissions with cryptographic timestamps before any processing.


Common Mistakes

:::danger Deploying AES for Placement Without Fairness Audit Automated essay scoring trained on standard academic English systematically underestimates writing quality from non-native speakers and speakers of stigmatized dialects. Deploying such a system for placement decisions without a demographic bias audit means students are placed in remedial courses based on linguistic background, not writing quality. The 4/5ths rule is a legal threshold in US employment law and a useful standard for educational fairness: if any group's mean score is below 80% of the highest-scoring group's mean, flag for investigation. :::

:::danger Truncating Long Essays at the BERT Window Without Strategy BERT's 512-token limit means a 1000-word essay gets cut. Naive truncation at the end means the conclusion and final arguments are never scored. Use a principled truncation strategy: always include the introduction paragraph, always include the conclusion paragraph, sample from the middle. Better yet, use hierarchical encoding or Longformer (4096 tokens). Whatever strategy you use, make it explicit and evaluate whether it introduces systematic bias (e.g., always missing arguments that appear in the middle of the essay). :::

:::warning LLM Feedback Can Mislead Confidently LLMs generate fluent, confident-sounding feedback. They can describe a paragraph's argument incorrectly, praise a factual error, or give feedback that directly reveals the correct answer. Always run automated validation (does the feedback reference actual text from the essay? does it contain answer-revealing phrases?) and sample human review of generated feedback before and after deployment. :::

:::warning Quadratic Weighted Kappa Can Mask Directional Bias QWK measures agreement but is symmetric - it does not distinguish between a model that systematically overscores and one that systematically underscores. A model can have good QWK (0.80) while systematically inflating scores for well-formatted essays and deflating scores for content-rich but rough essays. Always report mean score bias (model score minus human score) in addition to QWK. :::


Interview Questions and Answers

Q1: What is Quadratic Weighted Kappa and why is it used instead of accuracy or RMSE for essay scoring?

QWK measures agreement between two ordinal raters weighted so that larger disagreements are penalized more than smaller ones. A model that says 4 when the true score is 1 is penalized much more than a model that says 3 when the true score is 1.

Accuracy is inappropriate because essay scores are ordinal - being off by 1 is much better than being off by 3, but accuracy treats all errors equally. RMSE is better but it is unbounded and harder to compare across datasets with different score ranges. QWK is in [-1, 1], where 1 is perfect agreement, 0 is chance-level, and -1 is perfect systematic disagreement. Human-human QWK on the ASAP dataset is 0.75-0.85; good automated systems reach 0.80+. The metric is widely used in AES competitions and benchmarks.

Q2: How would you handle essays longer than BERT's 512-token context limit?

Several approaches exist, each with tradeoffs. Hierarchical encoding: split the essay into paragraphs, encode each with BERT, then aggregate paragraph representations with a second-level encoder. This preserves essay structure and can encode unlimited length. Sliding windows: process overlapping 512-token chunks and aggregate with attention pooling. Longformer or BigBird: sparse attention mechanisms extend the context to 4096 or 16384 tokens - appropriate for most essays. Truncation: only if essays are rarely longer than 512 tokens and you can show truncation does not systematically affect scores.

For essay scoring specifically, the conclusion and thesis statement are informationally dense. If truncating, always preserve the first 100 tokens and last 100 tokens and sample from the middle rather than hard-cutting at 512.

Q3: You discover your automated grading system has a systematic 0.3-point negative bias against non-native English speakers. What do you do?

Immediate action: flag submissions from this population for human review while investigating. Do not continue making placement decisions based on automated scores for this group.

Investigation: identify the source. Is it grammar error detection over-penalizing non-native surface errors? Vocabulary scoring rewarding academic words from one cultural context? Discourse structure expectations that differ across rhetorical traditions?

Long-term fix: collect training data that represents non-native English writers. If grammar errors are being over-penalized, train the model to separate surface error count from content quality - these should be separate rubric dimensions. Use adversarial data augmentation: take high-quality essays and introduce non-native English surface patterns while preserving content quality, then train the model not to penalize these patterns.

Monitoring: build a bias dashboard that tracks mean score by native language group and alerts when disparity exceeds 0.2 points. This is operational work, not just a one-time fix.

Q4: What is the difference between holistic and analytic/rubric-based AES, and when do you use each?

Holistic scoring assigns one overall quality score. Simpler to train, faster to compute, and appropriate for screening and ranking. Analytic scoring assigns separate scores to rubric dimensions (content, organization, style, mechanics). Requires more training data, a multi-task model, and more complex output handling - but produces actionable diagnostic information.

For formative feedback, analytic scoring is almost always better: a student who learns they scored 4/4 on content and 1/4 on mechanics knows exactly what to practice. For high-stakes summative assessment where a single score is needed (college placement, standardized testing), holistic scoring may be sufficient.

In practice, many systems use analytic scoring internally but present a holistic score to students, with the rubric scores available for drill-down. This gives you the best of both: simple communication of the overall score and detailed feedback when the student wants to understand why.

Q5: How would you build a system to detect LLM-generated student essays, and what are the ethical concerns?

Current detection approaches: classifier trained on human-written vs LLM-generated text. Models like GPTZero and Turnitin's AI detection reach 85-95% accuracy on held-out test sets. The problem is the false positive rate: 3-8% of human essays are classified as AI-generated. At a school of 1000 students, that is 30-80 false accusations per assessment. For high-stakes decisions, this is unacceptable.

Watermarking is more reliable: the LLM embeds statistical patterns in token choices that are detectable but invisible to the student. This requires the platform to use an LLM that supports watermarking.

Ethical concerns: the false positive rate disparately affects non-native English speakers and students with certain writing styles. Automated detection should never be the sole evidence in an academic integrity investigation. Process-based alternatives - requiring students to submit drafts, revision histories, and in-class writing samples - may be more equitable and more enforceable than post-hoc detection.

Q6: Short answer grading by cosine similarity misses a student who answered correctly but very differently. How do you improve it?

Cosine similarity between sentence embeddings works well for paraphrase equivalence but fails when correct answers are conceptually equivalent but use different frames. For example, "demand exceeds supply" and "supply is insufficient for demand" have low cosine similarity with each other but both correctly answer a question about shortages.

Better approach: decompose the reference answer into key concept requirements and check coverage independently. Define 3-5 concepts that must be present. Use per-concept semantic similarity: is the concept "supply falls" present in the student answer? Is the concept "price ceiling below equilibrium" addressed? Score based on how many rubric concepts are covered, not overall similarity to one reference phrasing.

For higher reliability, use a cross-encoder (fine-tuned DeBERTa or similar) that directly compares the student answer with each rubric requirement and outputs an entailment probability. Cross-encoders are more accurate than bi-encoder cosine similarity but are slower because they cannot pre-compute reference embeddings.


Summary

AI-powered assessment spans automated essay scoring (BERT fine-tuned with QWK as the metric), rubric-based multi-dimensional scoring (multi-task learning), short answer grading (semantic similarity and NLI entailment), LLM feedback generation (structured prompts with quality validation), and plagiarism detection (fingerprint plus semantic similarity). The technical challenges are manageable. The hard challenges are fairness (bias auditing is non-negotiable), adversarial gaming (students can and will optimize for the model), and human oversight design (what gets automated vs what requires human review). For any consequential educational decision, the threshold for human review should be set conservatively and reviewed regularly.

© 2026 EngineersOfAI. All rights reserved.