How does learnable work in practice?

Not All Disagreement Is Learnable: Token Teachability in On-Policy Distillation covers disagreement, learnable, token from first principles with code examples. Free lesson at https://engineersofai.com/docs/research/paper-breakdowns/2026-05-26-not-all-disagreement-is-learnable-token-teachability-in-onpolicy-distillation

What is the difference between disagreement and token?

See the full breakdown at https://engineersofai.com/docs/research/paper-breakdowns/2026-05-26-not-all-disagreement-is-learnable-token-teachability-in-onpolicy-distillation

Not All Disagreement Is Learnable: Token Teachability in On-Policy Distillation

:::info Stub — Full Engineering Breakdown Coming This paper has a linked code implementation and was featured on Hugging Face Papers with 19 upvotes. A full breakdown with production viability rating, implementation notes, and honest limitations is being written. Subscribe to AI Letters → :::


Authors	Yuanyi Wang et al.
Year	2026
HF Upvotes	19
arXiv	2605.26844
PDF	Download
Code	https://github.com/wyy-code/TA-OPD

Abstract

On-policy distillation (OPD) trains a student on its own rollouts with token-level teacher supervision. Recent selective OPD methods exploit the non-uniformity of OPD signals by prioritizing high-entropy or high-disagreement tokens. We revisit this principle and ask: which token-level teacher signals are actually learnable? Using a fixed-context diagnostic that measures same-context teacher-student KL reduction, we show that raw KL disagreement is a coarse proxy for learning value. It conflates learnable disagreement, where the teacher assigns corrective mass to the student's top-K candidates, with incompatible disagreement, where the teacher places mass mostly off the student's current support. We formalize this local compatibility as token teachability and show that it better predicts fixed-context improvement than raw KL alone. Motivated by this finding, we propose Teachability-Aware OPD (TA-OPD), a lightweight token-position selection method that applies OPD loss to high-teachability positions without reward models or verifiers. Across Qwen2.5 and Qwen 3 teacher-student settings, TA-OPD often surpasses full-token OPD with only 5% retained tokens and improves over entropy- and divergence-based baselines. Our results reframe selective OPD as selecting learnable teacher signals rather than merely salient tokens.

Engineering Breakdown

The Problem

On-policy distillation (OPD) trains a student on its own rollouts with token-level teacher supervision.

The Approach

Using a fixed-context diagnostic that measures same-context teacher-student KL reduction, we show that raw KL disagreement is a coarse proxy for learning value. Motivated by this finding, we propose Teachability-Aware OPD (TA-OPD), a lightweight token-position selection method that applies OPD loss to high-teachability positions without reward models or verifiers.

Key Results

Our results reframe selective OPD as selecting learnable teacher signals rather than merely salient tokens.

Research Areas

This paper contributes to the following areas of AI/ML engineering:

Machine learning
Deep learning
Neural networks
Model optimization
AI systems
Disagreement

:::tip Subscribe Get weekly breakdowns of papers like this in AI Letters - the newsletter for engineers building production AI systems. :::

Back to Research Lab → · Subscribe to AI Letters →

Abstract​

Engineering Breakdown​

The Problem​

The Approach​

Key Results​

Research Areas​