---
# System prepended metadata

title: 'The Imitation Game: State of Policy Distillation in Language Model training'

---

# The Imitation Game: State of Policy Distillation in Language Model training
## Table of Contents

1. Introduction
2. But what is On-policy Distillation?
3. Why only On-policy Distillation though?
4. A survey on On-Policy Distillation
   * On-Policy Distillation
   * On Policy Self Distillation
5. My opinion on the failure modes of On-Policy Distillation
6. Open Problems
7. OPD in the wild
8. Final thoughts
9. Appendix
## Introduction
Policy Distillation or Knowledge distillation, more specifically On policy distillation is all the hype in the post training community right now, and for good reason. In 2026 alone, there have been ~40 papers released on the topic. 

The inspiration to write this post comes from just the sheer volume of content on the same, and my attempt to simplify and have all of my notes/thoughts about the wins and pitfalls of OPD in the same place. 

## But what is On policy distillation? 
On Policy Distillation is a variant of knowledge distillation, where a teacher model(usually a larger model) is used to distill it's knowledge in a smaller model, which helps us to achieve similar performance to the teacher model but in fraction of the training costs. 

More formally, knowledge distillation aims to train a student model $\pi_\theta$ to mimic the behavior of a teacher model $\pi_T$, by minimizing the divergence between their output distributions.

**Standard (Off-Policy) KD Objective**:
\begin{equation}
    \mathcal{L}_{KD}(\theta) = \mathbb{E}_{x \sim \mathcal{D}} \left[ D_{KL} \left( \pi_T(\cdot \mid x) \parallel \pi_\theta(\cdot \mid x) \right) \right]
\end{equation}

where D is a fixed offline dataset, $x$ is the input prompt, $\pi_T(\cdot \mid x)$ is the teacher's output distribution and $\pi_\theta(\cdot \mid x)$ is the student's output distribution. Here the dataset has both x and y, you just compare $\pi_T(\cdot \mid x)$ vs $\pi_\theta(\cdot \mid x)$ over the next token distribution at each position in the pre-existing y. 

On policy distillation is a variant of the same, where the data that the student model is trained/receives supervision on, is generated by itself(that is it comes from the same policy that is being trained). Formally, 

**On-Policy KD Objective**: 
\begin{equation}
    \mathcal{L}_{OnKD}(\theta) = \mathbb{E}_{x \sim \mathcal{D},\, y \sim \pi_\theta(\cdot \mid x)} \left[ D_{KL} \left( \pi_T(\cdot \mid x, y) \parallel \pi_\theta(\cdot \mid x, y) \right) \right]
\end{equation}

where the key distinction is that $y \sim \pi_\theta(\cdot \mid x)$ — the responses are sampled from the *student's own distribution* rather than a fixed dataset.
![image](https://hackmd.io/_uploads/H1KH3KeeGe.png)

**Notation:** the rest of the post leans on a lot of math, so it helps to fix the symbols once. Nothing here is new, this is just a cheat sheet to come back to.

- **Prompt and response.** $x$ is the input prompt (say, "what is 12 + 7?") and $y = (y_1, \dots, y_T)$ is the response token sequence (e.g. "the answer is 19"). $y_{<t}$ is everything before position $t$, so if $y$ = "the answer is 19" and $t = 4$, then $y_{<t}$ = "the answer is". At every position, the model outputs a distribution over the full vocabulary $\mathcal{V}$ (roughly 50k–200k tokens for modern LMs), and a single token is sampled from it.
- **Policies.** $\pi_\theta$ is the student we are training. Different papers also write it as $p_S$ or $q_\theta$, all the same thing. $\pi_T$ (or $p_T$, $p$) is the teacher, usually frozen. When a method needs a third anchor model, like a pre-finetuning checkpoint to stay close to, we will call it $\pi_{\mathrm{ref}}$.
- **Dataset.** $\mathcal{D}$ is a set of prompts. In off-policy KD it contains $(x, y)$ pairs, in on-policy KD it usually just contains $x$ because the $y$ part comes from the student.
- **Forward vs reverse KL.** KL divergence measures how different two distributions are over the same vocabulary. Concretely, if the teacher's next-token distribution over $\mathcal{V} = \{A, B, C\}$ is $p_T = (0.7, 0.2, 0.1)$ and the student's is $\pi_\theta = (0.3, 0.6, 0.1)$, then $D_{\mathrm{KL}}(p_T \,\|\, \pi_\theta) = \sum_v p_T(v) \log \frac{p_T(v)}{\pi_\theta(v)} \approx 0.37$. The convention is: the first argument is the "reference", and the divergence blows up wherever the reference has mass but the other does not.
    - **Forward KL** $D_{\mathrm{KL}}(\pi_T \,\|\, \pi_\theta)$ has the teacher in front, so the student gets punished hard at any token the teacher likes but the student does not. The student has to cover every teacher mode. This is called **mode-covering**.
    - **Reverse KL** $D_{\mathrm{KL}}(\pi_\theta \,\|\, \pi_T)$ has the student in front, so the student is only punished where it itself puts mass on tokens the teacher dislikes. It can quietly ignore entire teacher modes as long as the modes it does pick are teacher-approved. This is **mode-seeking**.
    - SFT is forward KL, RL is reverse KL, and most of the OPD literature is a fight over which one to use and when.
- **Privileged information (PI).** Some methods give the teacher something the student doesn't see, like a worked solution or extra context. We write that as $y^{*}$ or $c$, and just say "PI" throughout the self-distillation section.
- **Expectations and on/off-policy.** $\mathbb{E}_{y \sim \pi_\theta(\cdot \mid x)}[\cdot]$ means "average over responses sampled from the student". Whatever sits in the subscript is the policy that generated the data. If that subscript is the student we are training, we are **on-policy**. If it is the teacher or a fixed dataset, we are **off-policy**. That single subscript is the whole game.


## Why only On-policy Distillation though? 
A big problem faced in LLM Post-training is the problem of catastrophic forgetting. This refers to the drastic drop in performance across other tasks as the model becomes better at the task it is being finetuned for. 

The main reason for the same is distributional shift, wherein the model deviates too far away from its original pretrained distribution, thus undergoing mode collapse and losing perf. 

In recent times, this has been studied in detail with excellent explanations in two papers: [RL's Razor](https://arxiv.org/abs/2509.04259) and [Retaining by Doing](https://arxiv.org/abs/2510.18874). 

RL's Razor argues that calculated KL divergence between the distributions is a good measure of catastrophic forgetting, since the divergence between two policies is how far they are from each other in token space. 
They further argue that policy gradient methods can be understood
as a conservative projection that keeps the policy close to it's starting point while reweighting toward higher-reward outcomes. At each step, the policy samples outputs it already finds likely, then re-weights those samples according to reward, shifting probability mass toward higher reward outcomes while suppressing lower-reward ones. Since the data sampled and hence the updates are relative to the policy itself, they are policy-local and nearby unlike distant updates due to SFT. 

Retaining by Doing, further explains this phenonmenon with clear detail, bringing in the mode-covering and mode-seeking behaviour of SFT and RL respectively. 

The SFT objective decomposes into a forward KL plus a fixed entropy term, so up to constants it is exactly forward-KL minimisation:
\begin{equation}
    \mathcal{L}_{\mathrm{SFT}}(\theta; x)
    = \mathrm{KL}\!\left[\pi^*(\cdot \mid x)\,\|\,\pi_\theta(\cdot \mid x)\right] + \mathcal{H}(\pi^*(\cdot \mid x)).
\end{equation}
 

This objective tends to zero, only when the $\pi_T$ and  $\pi_\theta$ are completely overlapping. Thus, the policy is incenitivised **not** to miss any mode. So the probability mass has to be spread across anything that $\pi_T$ considers likely, and thus it is mode covering. 

Similarly, the KL-regularised RL objective rearranges into a reverse KL against the optimal tilted policy $\pi^{*}$:
\begin{equation}
    J_{\mathrm{RL}}(\theta; x)
    = \mathbb{E}_{y \sim \pi_\theta}[r(x,y)] - \beta\,\mathrm{KL}\!\left[\pi_\theta(\cdot \mid x)\,\|\,\pi_{\theta_0}(\cdot \mid x)\right]
    = -\beta\,\mathrm{KL}\!\left[\pi_\theta(\cdot \mid x)\,\|\,\pi^*(\cdot \mid x)\right] + \beta \log Z(x).
\end{equation}

The policy is penalised only when it covers/places high probability at a place where the teacher thinks it is unlikely, and thus it can ignore entire modes of the teacher as long as it covers desired modes well. 
![image](https://hackmd.io/_uploads/BkMJ3tleMx.png)

SFT works well in a unimodal setting(modality here refers to a task/skill we care about), for example in neural machine translation etc. and does not forget since the one singular mode that we optimise for is all that we need. However, in the case of a multi-modal case, for example modern LMs, where we have multiple modalities such as code, math, writing etc baked into one model, the mode-covering nature becomes often a cause of distributional drift. 

RL works well in this case, as it can often seek the only mode we use it to finetune, and since it is not penalised for not covering other modes, it does not try to, thus preserving intended behaviour, and works well.

Another reason for preferring On-policy distillation compared to Off-policy distillation is error compounding. 

**Error Compounding: The Quadratic Tax of Off-Policy Training**

Beyond forgetting, there is another fundamental problem with off-policy distillation that is worth dwelling on, error compounding. This was formalized by Ross et al. (2011) via the DAgger analysis, which showed that training on a fixed offline dataset leads to *quadratic* error accumulation over the sequence length, and that on-policy training reduces it to linear:
\begin{equation}
    \mathbb{E}_{s \sim d_{\pi_\theta}}\!\left[\textstyle\sum_{t=1}^{T} \ell(s_t)\right] \leq O(\epsilon T^2)
    \quad \xrightarrow{\text{OPD}} \quad O(\epsilon T).
\end{equation}

![image](https://hackmd.io/_uploads/ryLb3txgfl.png)


The intuition here is subtle but important. In off-policy training, the student is supervised on sequences generated by the teacher or a fixed dataset, responses/sequences it would never have produced itself. At inference time however, the student generates its own prefixes, and any early mistake shifts the sequence into a region of token space the model was never trained on. The model, never having seen this type of prefix, is more likely to make mistakes or undergo mode collapse. Each error makes the next more likely and thus error blows up quadratically.

For LLMs generating 1000+ tokens, this might turn out to be catastrophic. A single bad token early in a chain-of-thought can send the entire reasoning trace off the rails, and the model has no mechanism to recover because every subsequent state is increasingly out-of-distribution relative to its training data.

On-policy distillation directly addresses this by closing the **exposure bias gap**, the distributional mismatch between training-time inputs (teacher-generated) and inference-time inputs (student-generated). Since the student is always trained on its own generations, the prefixes it sees during training are exactly the kind of prefixes it will produce at inference. Early errors are therefore in-distribution, the compounding effect is dampened, and cumulative error grows only linearly with sequence length. This is a great and useful improvement for modern language models as they produce longer and longer agentic trajectories. 

Now that we have a general idea of what On-Policy Distillation is, let's have a look into what's the latest work in the field, it's wins and the failure modes. 


## A survey on On-Policy Distillation

We will try to cover some of the recent works and advances in Policy Distillation in majorly three sections: 
1. On-Policy Distillation
2. On-policy Self Distillation
3. Enabling On-Policy Distillation(Works on cross tokenisation and improving knowledge gaps that I find interesting)


### On-Policy Distillation

#### MiniLLM: On-Policy Distillation of Large Language Models

MiniLLM is the foundational paper that explicitly framed LLM knowledge distillation as on-policy distillation, and most methods later in this section are refinements over its setup. The core move is to replace the forward KL in standard KD with reverse KL, and treat distillation as an RL problem where the teacher's log-prob acts as the reward signal. 

The reason this matters is mode coverage. Forward KL forces a small student to cover every mode of a much larger teacher, which it simply does not have the capacity for, and the result is mass spread over regions the student cannot represent well. Reverse KL flips the incentive: the student can ignore parts of the teacher distribution it cannot reach, and focus on the ones it can. This is the **mode-seeking** nature of reverse KL divergence, and the paper argues it is exactly what makes it useful for distilling a large teacher into a smaller student.

The MiniLLM objective minimises reverse KL between the student $q_\theta$ and the teacher $p$, which is equivalent to a policy-gradient setup where the per-token log-ratio acts as the reward:
\begin{equation}
    \theta = \arg\min_\theta \, \mathrm{KL}\!\left[q_\theta \,\|\, p\right],
    \qquad
    r_t = \log \frac{p(y_t \mid y_{<t}, x)}{q_\theta(y_t \mid y_{<t}, x)}.
\end{equation}

If the teacher assigns higher probability to a token than the student does, $r_t > 0$ and the token is encouraged; if the student is overconfident relative to the teacher, $r_t < 0$ and it is discouraged. The trajectory return is just the suffix sum $R_t = \sum_{t'=t}^{T} r_{t'}$, which gives token-level supervision with sequence-level credit assignment.

On top of this, MiniLLM adds three practical fixes:

1. **Single-step decomposition.** The gradient is split into an immediate token-level term and a long-term future-reward term, $\nabla \mathcal{L} = (\nabla \mathcal{L})_{\mathrm{Single}} + (\nabla \mathcal{L})_{\mathrm{Long}}$. The single-step term computes the reverse-KL signal directly over the vocabulary at each step, so every prefix receives a dense next-token correction rather than a single scalar.

2. **Teacher-mixed sampling.** Pure student rollouts are low-quality early in training, so MiniLLM samples from $\tilde{p} = \alpha\, p + (1-\alpha)\, q_\theta$ and corrects with an approximate importance weight $w_t \approx q_\theta(y_t \mid \cdot) / \tilde{p}(y_t \mid \cdot)$. This keeps trajectories near the teacher's support and damps reward hacking.

3. **Length normalisation.** The long-term return is divided by the remaining length so the model is not biased toward short outputs.

The final update combines the single-step term, the length-normalised long-term term, and an auxiliary pretraining loss $\mathcal{L}_{PT}$ to preserve general language-modeling ability. The whole thing is best read as policy gradient with the teacher acting as the reward model, plus the three fixes above to keep variance under control.

#### On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes

Where MiniLLM commits to a fully on-policy setup with reverse KL, GKD asks whether we have to pick a side at all. It addresses the same **train-inference distribution mismatch** but with a more flexible fix: interpolate between dataset-supervised KD and on-policy KD with a single knob $\lambda$, and treat the divergence itself (forward KL, reverse KL, JSD) as another knob. 

The motivation is straightforward. If you only train on dataset sequences, the student never sees its own mistakes during training, so at inference time it walks into prefixes it has never been supervised on. If you only train on student rollouts, early in training the rollouts are garbage and the teacher's feedback on garbage is not always useful. GKD mixes them.

The student samples its own output $y \sim p_S^\theta(\cdot \mid x)$, the teacher provides per-token feedback on that trajectory, and GKD combines the on-policy term with a dataset-supervised term via a single mixing weight $\lambda$:
\begin{equation}
    \mathcal{L}_{\mathrm{GKD}}(\theta)
    = (1-\lambda)\,\mathbb{E}_{(x,y) \sim (X,Y)}\!\left[\mathcal{D}(p_T \,\|\, p_S^\theta)(y \mid x)\right]
    + \lambda\,\mathbb{E}_{x \sim X,\, y \sim p_S^\theta(\cdot \mid x)}\!\left[\mathcal{D}(p_T \,\|\, p_S^\theta)(y \mid x)\right].
\end{equation}
The token-level divergence $\mathcal{D}(p_T \,\|\, p_S^\theta)(y \mid x)$ is just the per-prefix divergence averaged over the sequence length, and gradients are not backpropagated through the sampling step. $\lambda = 0$ recovers ordinary supervised KD, $\lambda = 1$ is fully on-policy KD, and intermediate values interpolate. The choice of divergence ($\mathcal{D}$ as forward KL, reverse KL or JSD) is a second knob, and the paper discusses the mode-covering vs mode-seeking trade-off this exposes.

GKD also extends cleanly into RL fine-tuning by adding a distillation regulariser to the RL reward:
\begin{equation}
    \mathbb{E}_{x \sim X,\, y \sim p_S^\theta(\cdot \mid x)}\!\left[(1-\alpha)\,r(y) - \alpha\,\mathcal{D}(p_T \,\|\, p_S^\theta)(y \mid x)\right].
\end{equation}
At $\alpha = 0$ this is pure RL, at $\alpha = 1$ it is pure on-policy KD, and in between OPD acts as a teacher-anchored regulariser on the RL update. The main contribution of GKD is the unified framing: instead of picking off-policy vs on-policy and a divergence up front, you sweep both as hyperparameters. If you are unsure where on the spectrum you should be, GKD lets you find out.


#### DistiLLM: Towards Streamlined Distillation for Large Language Models

GKD treated the on-policy/off-policy choice as a single mixing dial $\lambda$ and the divergence as a separate pick from a fixed menu. DistiLLM ([Ko et al., 2024](https://arxiv.org/abs/2402.03898)) points out that both choices have specific failure modes that hurt OPD in practice, and proposes a fix on each axis.

On the divergence side, the standard reverse KL is mode-seeking and collapses onto whatever mode the student already prefers, while the forward KL is mode-covering and forces the student to spread mass over teacher modes it cannot represent. Neither behaves well early in training, where teacher and student distributions barely overlap and the log-ratio explodes in places. DistiLLM's *skew KL* sidesteps this by interpolating the second argument of the KL toward the first:
\begin{equation}
    \mathcal{D}_{\mathrm{SKL}}^{(\alpha)}(p \,\|\, q)
    :=
    \mathcal{D}_{\mathrm{KL}}
    \left(
    p \,\middle\|\, \alpha p + (1 - \alpha) q
    \right),
\end{equation}
with a symmetric *skew reverse KL* in the other direction. Mixing $p$ into the denominator bounds the log-ratio wherever $p$ has support, so the gradient stays well-behaved even when the two distributions disagree heavily, which is exactly the regime early-training OPD lives in. They also show the resulting objective has a tighter generalisation bound than vanilla KL.

On the sampling side, instead of GKD's fixed $\lambda$, they use an *adaptive off-policy* scheme that starts heavily off-policy (teacher-generated samples are cheap and behave well) and shifts toward student rollouts as the student stabilises. The point is to avoid burning compute on student rollouts when the student has nothing useful to say yet, and to avoid the exposure-bias plateau once it does. Read in the GKD frame, DistiLLM is what you get when you stop treating $\lambda$ as a hyperparameter and start treating it as a schedule, with skew KL as the divergence that makes that schedule actually trainable.


#### Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation

G-OPD picks up where MiniLLM leaves off and pushes the RL analogy further. It brings a third **reference** model $\pi_{\mathrm{ref}}$ into the objective, and through some rearranging the OPD loss starts looking a lot like the implicit-reward form from DPO. Once you have a reference, you can scale the reward term independently of the trust region, and that scaling factor $\lambda$ becomes a dial for how far past the teacher you want to push.

Starting from the reverse-KL OPD objective $\mathcal{J}_{\mathrm{OPD}}(\theta) = \min_\theta \mathbb{E}\left[\mathcal{D}_{\mathrm{KL}}(\pi_\theta \,\|\, \pi^*)\right]$, adding and subtracting $\log \pi_{\mathrm{ref}}(y \mid x)$ inside the expectation rearranges it into a log-ratio reward against the reference plus a KL trust region:
\begin{equation}
    \mathcal{J}_{\mathrm{OPD}}(\theta)
    = \max_{\theta}\;\mathbb{E}_{x \sim \mathcal{D},\, y \sim \pi_{\theta}(\cdot \mid x)}\!\left[
        \log\frac{\pi^{*}(y \mid x)}{\pi_{\mathrm{ref}}(y \mid x)}
        \;-\;
        \mathcal{D}_{\mathrm{KL}}\!\left(\pi_{\theta}(\cdot \mid x) \,\|\, \pi_{\mathrm{ref}}(\cdot \mid x)\right)
    \right].
\end{equation}
The log-ratio term is exactly the implicit-reward shape from DPO ($r(x,y) = \beta \log \pi_\theta/\pi_{\mathrm{ref}} + \beta \log Z(x)$, with the $Z$ term dropping out since it only depends on $x$), which means the teacher acts as dense supervision and the reference acts as the trust-region anchor. The generalised G-OPD objective scales the reward term by $\lambda$:
\begin{equation}
    \mathcal{J}_{\mathrm{G\text{-}OPD}}(\theta)
    = \max_{\theta}\;\mathbb{E}_{x \sim \mathcal{D},\, y \sim \pi_{\theta}(\cdot \mid x)}\!\left[
        \lambda\,\log\frac{\pi^{*}(y \mid x)}{\pi_{\mathrm{ref}}(y \mid x)}
        \;-\;
        \mathcal{D}_{\mathrm{KL}}\!\left(\pi_{\theta}(\cdot \mid x) \,\|\, \pi_{\mathrm{ref}}(\cdot \mid x)\right)
    \right].
\end{equation}

where playing with  0 < $\lambda$ < 1 results in student performance between the reference model and the teacher model. With $\lambda$ > 1 ,G-OPD encourages
the student’s log-probability distribution to go beyond matching the teacher’s log-probabilities and thus they report pretty interesting results with multi-teacher OPD for good generalisation and strong to weak distillation setting. The teacher is no longer a hard ceiling here, and any decent reference policy lying around (your own SFT checkpoint works) is enough to extrapolate past it. 

#### Asymmetric On-Policy Distillation: Bridging Exploitation and Imitation at the Token Level

MiniLLM, GKD and G-OPD all treat positive and negative tokens with the same objective. AOPD's claim is that this symmetry is exactly what causes a lot of the instability people see in OPD runs: the negative-side gradients are noisier, heavier-tailed, and behave qualitatively differently from the positive-side ones. Their fix is to split the loss in two and route each half to a different objective.

They start by separating the OPD objective into per-token positive and negative reinforcement terms, $\mathcal{L}_{\mathrm{OPD}} = \mathcal{L}_{\mathrm{Pos}} + \mathcal{L}_{\mathrm{Neg}}$, each weighted by the advantage $A_t$ on its sign-subset of tokens. The hypothesis is that the two halves should be handled differently, mainly because of three issues in negative reinforcement: 
1. **Heavy tails in negative advantages** : Most tokens cluster near zero while the negative side exhibits a substantially broader tail. This extreme variance originates directly from the logarithmic nature of the advantage formulation. The difference between the log probability
of the teacher and the student amplifies exponentially as the student probability approaches zero, and since updates are dominated by this, leads to false inflated variance and gradients. 
2. **Stagnation at zero advantages**: Majority of sampled tokens cluster around zero advantage, thus giving no learning signal. 
3. **Exploration Black hole**: In the negative reinforcement part, if the teacher supresses a wrong token, it gets a negative advantage, but since that probability mass is released, it is distributed amongst other tokens that are unsampled, and usually goes to tokens already having a high probability, thus not updating the low-probability but correct tokens. 

This leads them to a per-token gated objective that uses a top-K forward KL on the negative/zero-advantage tokens and the standard OPD loss on the rest:
\begin{equation}
    \mathcal{L}_{\mathrm{AOPD}}
    = \mathbb{E}\!\left[\frac{1}{|y|}\sum_{t=1}^{|y|} G_t\,\mathcal{L}^{\mathrm{FKL}}_t + (1-G_t)\,\mathcal{L}^{\mathrm{OPD}}_t\right],
    \quad
    G_t = \mathbb{I}\!\left(P_T(y_t \mid c_t) \leq P_S(y_t \mid c_t)\right),
\end{equation}
with $\mathcal{L}^{\mathrm{FKL}}_t$ a top-K forward KL on $S_t = \operatorname{TopK}(P_T(\cdot \mid c_t), K)$. The gate $G_t$ uses a bounded probability difference instead of the log-ratio because log-ratios explode when probabilities are tiny: $G_t = 1$ promotes the teacher's preferred token even when the student currently assigns it near-zero probability, $G_t = 0$ falls back to full OPD. Nothing in the forward pass changes, only the loss, and they report noticeably more stable runs as a result.

#### On Policy-Context Distillation

Everything so far has assumed the teacher is a bigger model. On-Policy Context Distillation drops that assumption: the teacher and student can be the *same* weights, the only difference is what is in the prompt. You distill a model *with* extra context into the same model *without* that context. 

This is useful because in-context learning is expensive at inference time, every long context has to be re-read on every call. If you can bake the relevant context into the weights once via OPD, you get the behaviour for free at inference, and because samples come from the no-context student, the alignment between what the model knows internally and what its outputs look like stays tight. 

Their methodology is quite simple, they have two passes, one through a teacher model which has context, and the input prompt x. They also have a student model with no context answering input prompt x. They try to bridge the knowledge gap between the teacher and the student using a forward KL from the no-context student to the teacher-with-context, evaluated on student rollouts:
\begin{equation}
    \mathcal{L}(\theta)
    = \mathbb{E}_{(x,c) \sim \mathcal{D},\, y \sim \pi_{\theta}(\cdot \mid x)}\!\left[
        \tfrac{1}{|y|}\textstyle\sum_{t=1}^{|y|}
        D_{\mathrm{KL}}\!\left(\pi_{\theta}(\cdot \mid x, y_{<t}) \,\|\, \pi_{\mathrm{teacher}}(\cdot \mid c, x, y_{<t})\right)
    \right].
\end{equation}

This is again mode seeking, and interestingly they find that for smaller models, this works better than direct ICL. This suggests that onpolicy alignment between experiential knowledge and the model that consumes it is also crucial. 

Also here, an interesting thing is since the context is being distilled, the choice of teacher can be varied, and the policy model itself can be used for distilling the knowledge. Once you stop thinking of the teacher as "a bigger model" and start thinking of it as "any source of privileged information", you arrive at self distillation, which is the section we move forward to next. 

#### Putting it all together

Before we move to self-distillation, here is a compact view of the methods covered above and where each one is actually useful in practice. 

| Method | Sampling source | Divergence / signal | Extra ingredient | Best used when | Example / demo |
|---|---|---|---|---|---|
| MiniLLM | Student (with teacher mix-in) | Reverse KL as RL reward | Single-step decomposition, length norm, PT loss | You have a strong teacher and a much smaller student, and you want the student to focus on what it can actually represent. | Code summarisation: student writes generic phrases, teacher favors precise verbs like "parses" and "validates", and token rewards gradually sharpen wording on student-generated prefixes. |
| GKD | Mix of dataset and student rollouts ($\lambda$) | Any divergence (FKL / RKL / JSD) | Optional RL reward term | You want one knob to slide between off-policy KD and on-policy KD, or you want to fold KD into an RLHF run. | Staged training: start with mostly dataset KD, then increase rollout KD as student outputs become stable. |
| DistiLLM | Adaptive teacher / student mix (scheduled) | Skew KL / Skew RKL | Adaptive off-policy schedule | Vanilla KL is blowing up early in training, or pure on-policy KD is too slow and pure off-policy KD plateaus. | Early math tuning: rely on cleaner teacher samples first, then shift to student rollouts once chain-of-thought traces become coherent. |
| G-OPD | Student | Log-ratio reward vs reference + KL to reference | A reference model, scaling factor $\lambda$ | You want the student to potentially surpass the teacher, or you have multiple teachers / a weak teacher. | Domain transfer: reference is broadly safe, teacher is strong on symbolic tasks, and tuning $\lambda$ improves symbolic performance without losing baseline behavior. |
| AOPD | Student | OPD on positive tokens, top-K forward KL on negative tokens | Token-level mask $G_t$ | Your OPD runs are unstable, gradients are dominated by negative-advantage tails, or the student gets stuck. | Proof/code correction: overconfident wrong tokens get routed through safer negative-token correction instead of noisy symmetric updates. |
| On-Policy Context Distillation | Student (no-context) | Forward KL to teacher-with-context | A privileged-info teacher (often same weights + extra context) | You want to internalise long context, tool traces, or rationales into the weights so inference does not pay for them every time. | Policy QA: teacher sees retrieved policy text, student sees question only, and training transfers grounded phrasing into no-retrieval inference. |

A rough mental model: MiniLLM and GKD are the foundational "how do we do OPD at all" papers, DistiLLM refines the divergence and the sampling schedule so the foundational recipe actually trains, G-OPD and AOPD are refinements on the objective itself, and On-Policy Context Distillation is what happens when you stop thinking of the teacher as "a bigger model" and start thinking of it as "any source of privileged information". That last reframing is exactly what motivates the next section. 

### On Policy Self Distillation

On-Policy Context Distillation already nudged us in this direction: once the teacher is just "the same model with extra context", the bigger-teacher assumption is gone. On-Policy Self Distillation takes that all the way. The teacher and the student are the same weights, and the only thing the teacher has is some privileged information (PI) the student does not see. The job of on-policy distillation here is to induce that hidden knowledge into the student.
![image](https://hackmd.io/_uploads/SkKwhFllGg.png)


#### Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

This is the paper that kickstarted the wave of self-distillation work this year. The motivation is how humans learn from reference solutions: you attempt the problem yourself, then look at a worked solution and update what you would do next time. The mechanic mirrors that. You sample a solution $\hat{y}$ from the student for a problem $x$, then forward-pass that same $\hat{y}$ through two copies of the student: one conditioned only on $x$, and one conditioned on $x$ plus a reference solution $y^{*}$ as PI. Nothing is regenerated, you just score the student's own tokens under both views.
\begin{equation}
    \mathcal{L}_{\mathrm{OPSD}}(\theta)
    = \frac{1}{|\mathcal{B}|}\sum_{(x,y^{*}) \in \mathcal{B}} \frac{1}{|\hat{y}|}\sum_{n=1}^{|\hat{y}|}
        D\!\left(p_T(\cdot \mid \hat{y}_{<n}, x, y^{*}) \,\|\, p_S(\cdot \mid \hat{y}_{<n}, x)\right).
\end{equation}
where $D$ can be any divergence (they use JSD). They also find that stylistic tokens have much higher per-token divergence than meaningful reasoning tokens, and add a simple per-token clip to stop a few stylistic outliers from dominating the loss:
\begin{equation}
    D_{\mathrm{clip}}(p_T \,\|\, p_S) = \tfrac{1}{|\hat{y}|}\textstyle\sum_{n,v} \min(\ell_{n,v},\, \tau).
\end{equation}

They also derive a policy-gradient form where the per-token log-ratio between PI-conditioned and no-PI views acts as the advantage:
\begin{equation}
    A_n(x, \hat{y}) = \log p_T(\hat{y}_n \mid x, y^{*}, \hat{y}_{<n}) - \log p_S(\hat{y}_n \mid x, \hat{y}_{<n}),
    \qquad
    \mathcal{L}(\theta) = -\mathbb{E}\!\left[\tfrac{1}{|\hat{y}|}\textstyle\sum_n A_n \log p_S(\hat{y}_n \mid x, \hat{y}_{<n})\right].
\end{equation}
The log-ratio replaces the usual reward-minus-baseline, and the policy gradient is now driven by per-token dense supervision instead of a sparse trajectory reward. The gradient does not flow through the advantage, it is treated as a constant. If you let it flow, the advantage would shift with the policy on every update and the variance would blow up.

The divergence ablation is also worth flagging: forward KL wins for per-token supervision. That fits the setup. Since the teacher and student share weights, the usual capacity tax of forward KL, smearing student mass over teacher modes it cannot represent, simply does not apply here. The teacher conditional is essentially a sharpened version of the student's own conditional under the PI, and mode-covering is exactly the behaviour you want, since the student should pick up mass at every token the PI lifts. They also find full-vocab logit distillation beats the sampled-token advantage policy gradient.

#### Self-Distillation Enables Continual Learning

SDFT keeps the log-ratio-as-advantage idea from Self-Distilled Reasoner, but reframes it for continual learning and spends most of its effort on what the teacher actually is. The gradient estimator has the same shape as Self-Distilled Reasoner's, with the PI-conditioned view $\pi(\cdot \mid x, c)$ in the denominator of the per-token log-ratio.

The core claim is that a PI-conditioned policy is a good approximation of the optimal policy for the task at hand, under two conditions:
1. **Optimality**: the teacher's expected reward matches that of the unknown optimal policy, i.e. the PI-conditioned policy is actually solving the task.
2. **Minimal deviation**: among all policies that maximise reward, the optimal one is the closest in divergence to the current model.

Where SDFT really differs from Self-Distilled Reasoner is the teacher parametrisation, which they treat as a first-class design choice rather than "same weights, extra context". They ablate three options: a frozen teacher gives stable training but lags because it never reflects student updates, using the live student checkpoint as its own teacher is unstable, and an EMA of the student checkpoint sits in the middle and gives the best of both. They also provide empirical validation for the ICL assumption that underwrites the PI-conditioned-policy-as-teacher story.

#### SDPO: Reinforcement Learning via Self-Distillation

SDPO keeps the same PI-as-teacher recipe from Self-Distilled Reasoner and SDFT, but stretches what PI is allowed to be. In Self-Distilled Reasoner the PI was a reference solution, in SDFT it was task context. SDPO uses the *environment's own textual feedback*: runtime errors, judge critiques, test traces, whatever the verifier already produces alongside its scalar reward. The argument is that standard RLVR throws this text away and learns from a single outcome scalar, which is a brutal credit-assignment bottleneck on long trajectories. 

The mechanic is the same two-pass trick. For a prompt $x$ and a rolled-out attempt $y$, the model gets a textual feedback string $f$ from the environment (the stack trace, the judge's note, sometimes just a known-good rollout used as implicit feedback for a failed one). The same model conditioned on $(x, f)$ acts as the self-teacher, written $q_\theta(\cdot \mid x, f) := \pi_\theta(\cdot \mid \mathrm{reprompt}(x, f))$, and the SDPO loss is a per-token KL distillation along the student's own trajectory:
\begin{equation}
    \mathcal{L}_{\mathrm{SDPO}}(\theta)
    :=
    \sum_{t=1}^{T}
    D_{\mathrm{KL}}
    \left(
    \pi_{\theta}(\cdot \mid x, y_{<t})
    \,\middle\|\,
    \mathrm{stopgrad}
    \left(
    q_{\theta}(\cdot \mid x, f, y_{<t})
    \right)
    \right).
\end{equation}
The stop-gradient on the teacher is what stops it from regressing toward the student and ignoring $f$ entirely. The gradient has the same log-ratio-as-advantage shape as Self-Distilled Reasoner, just with $f$ in place of $y^{*}$, which makes the connection across the section explicit.

The conceptual move is that "in-context retrospection" is treated as a free teacher. The model already knows how to read an error message and adjust, that capability is just locked behind the prompt. It also works in plain RLVR settings with only scalar rewards by using a successful rollout for the same prompt as the "feedback" for a failed one, which is a neat way to recover dense supervision from sparse signals.

The other nice property is that SDPO drops in at test time too. Run it as an inner loop on a single hard question, and the discovery probability per attempt goes up enough that you hit the same success rate as best-of-$k$ with roughly a third of the attempts. So the same objective doubles as both a training-time and an inference-time procedure, which is rare for OPD methods.

#### GATES: Self-Distillation under Privileged Context with Consensus Gating

GATES sits in the same slot as Self-Distilled Reasoner and SDPO, the teacher is the same model with extra context, but it confronts a problem the earlier papers quietly assume away: what if the PI-conditioned teacher is *wrong*? Self-Distilled Reasoner gets a worked solution, so the teacher is trustworthy by construction. SDPO gets a verifier-grounded error trace. GATES targets document-grounded QA where the tutor sees the source document and the student does not, and there are no ground-truth labels or verifiers anywhere.

Their fix is to derive the reliability signal online from the tutor itself. For each prompt $q_i$, sample $k$ document-grounded tutor rollouts, extract the final answer from each, and define a question-level gate $g_i \in \{0,1\}$ which is $1$ when at least $\tau$ of the $k$ rollouts agree on the same answer $a_i^{*}$ and $0$ otherwise. A second rollout-level eligibility $e_{i,j} \in \{0,1\}$ keeps only the tutor rollouts whose answer matches the consensus and survive a leakage filter. The intuition is the usual self-consistency one, when the model is right it tends to converge on the same answer across samples, when it is guessing it spreads. So consensus across PI-conditioned rollouts is a proxy for the answer being correct, and gating on it filters out the prompts where distillation would do more harm than good.

Once a prompt clears the gate, GATES distils the *full* tutor reasoning trajectory into the document-free student via a token-level NLL on the surviving tutor tokens:
\begin{equation}
    \mathcal{L}_{\mathrm{off}}(\theta) = -\tfrac{1}{N}\textstyle\sum_{i,j} g_i\, e_{i,j} \sum_t \log \pi_\theta\!\left(y_{i,j,t}^{(T)} \,\middle|\, y_{i,j,<t}^{(T)}, q_i\right),
\end{equation}
normalised by the total number of surviving tutor tokens. They also add an on-policy variant: sample student rollouts $y^{(S)}$ and score each token with a clipped log-ratio advantage,
\begin{equation}
    A_t = \mathrm{clip}\!\left(\log \pi_T(y_t^{(S)} \mid y_{<t}^{(S)}, d, q) - \log \pi_\theta(y_t^{(S)} \mid y_{<t}^{(S)}, q),\; [-a, a]\right),
\end{equation}
with $\mathrm{stopgrad}$ on $A_t$, then combine the two terms:
\begin{equation}
    \mathcal{L}(\theta) = \lambda_{\mathrm{off}}\,\mathcal{L}_{\mathrm{off}} + \lambda_{\mathrm{on}}\,\mathcal{L}_{\mathrm{on}},
\end{equation}
with $\lambda_{\mathrm{off}} = 1.0$, $\lambda_{\mathrm{on}} = 0.1$ in their canonical setup. The off-policy trajectory term does most of the empirical work. Trajectories give dense per-token supervision, the same reason MiniLLM and GKD wanted on-policy KD instead of label-only KD, and the same reason Self-Distilled Reasoner wanted log-ratio supervision instead of a final reward. Answer-only distillation would collapse the rich PI-induced behaviour into a single token target and waste the whole point of having a teacher.

The pattern to notice across these three papers is that the PI keeps generalising. Self-Distilled Reasoner used reference solutions, SDPO used environment feedback, GATES uses retrieved documents plus a self-consistency gate. The objective barely changes, what changes is where the privileged signal comes from and how much you trust it.

#### CRISP: Compressed Reasoning via Iterative Self-Policy Distillation

CRISP is the most stripped-down version of the PI idea in this section, and probably the most fun. The PI is not a worked solution, not environment feedback, not a document. It is a single instruction in the prompt that says "be concise". That is it. The teacher is the same model conditioned on that instruction, the student is the same model without it, and the loss is per-token reverse KL on the student's own rollouts:
\begin{equation}
    \mathcal{L}_{\mathrm{CRISP}}(\theta)
    = \mathbb{E}_{x,\, \hat{y} \sim \pi_{\theta}(\cdot \mid x)}\!\left[
        \tfrac{1}{|\hat{y}|}\textstyle\sum_t
        D_{\mathrm{KL}}\!\left(\pi_{\theta}(\cdot \mid x, \hat{y}_{<t}) \,\|\, \pi_{\theta}(\cdot \mid x, c, \hat{y}_{<t})\right)
    \right],
\end{equation}
where $c$ is the "be concise" instruction. Reverse KL here is the right pick for the same reason MiniLLM picked it for the cross-model case: the student should mode-seek towards the concise modes the PI-conditioned teacher prefers and ignore the verbose ones, not spread mass across both. There are no ground-truth answers in the loop, no token budgets, no difficulty estimator deciding which problems to compress.

The result is more interesting than the setup suggests. CRISP automatically compresses easy problems hard and leaves the deliberation on hard ones mostly intact, which is what you would want a token-budget heuristic to do but without having to write one. 

Two implementation details worth flagging. First, qualitative instructions like "be concise" beat explicit token-count targets, which lines up with the broader observation that LMs are bad at counting and good at following style cues. Second, they refresh the teacher checkpoint periodically rather than holding it frozen or using the live student, which is the same EMA-ish middle ground SDFT landed on. 

#### Self-Distilled RLVR

RLSD is the first paper in this section that pushes back on pure self-distillation rather than extending it. Their critique is that learning signals derived *only* from a PI-conditioned teacher cause "severe information leakage and unstable long-term training". The leakage intuition is the one you would expect: if the teacher's only edge is privileged context, and the student is being trained to match the teacher's next-token distribution everywhere, the student eventually learns to imitate PI-shaped outputs on prompts where it has no PI, which looks like a win on the training distribution and falls apart elsewhere. 

Their fix is to split what the supervision provides. Self-distillation is good at saying *how much* a token should move, since the per-token log-ratio between the PI-conditioned and no-PI views is a fine-grained magnitude signal. RLVR is good at saying *which direction* is correct, since environment correctness is the only signal that is actually grounded in the task. RLSD takes the update magnitudes from self-distillation and the update directions from RLVR, and combines them into a single per-token update.

Concretely, for each token in a student rollout they take the privileged information gain
\begin{equation}
    \Delta_t = \mathrm{sg}\!\left(\log P_T(y_t) - \log P_S(y_t)\right),
\end{equation}
where $P_T = \pi_{\theta}(\cdot \mid x, r, y_{<t})$ is the PI-conditioned view and $P_S = \pi_{\theta}(\cdot \mid x, y_{<t})$ is the no-PI view, build a per-token evidence weight
\begin{equation}
    w_t = \exp\!\left(\mathrm{sign}(A)\cdot \Delta_t\right) = \left(P_T(y_t)/P_S(y_t)\right)^{\mathrm{sign}(A)},
\end{equation}
where $A$ is the trajectory-level GRPO advantage, and plug it into a PPO-style clipped objective:
\begin{equation}
    \mathcal{L}_{\mathrm{RLSD}}(\theta)
    = \mathbb{E}\!\left[\tfrac{1}{G}\textstyle\sum_i \tfrac{1}{|y^{(i)}|}\sum_t
        \min\!\left(w_t A^{(i)},\; \mathrm{clip}(w_t, 1-\epsilon_w, 1+\epsilon_w)\,A^{(i)}\right)
    \right].
\end{equation}
So the teacher log-ratio still drives how aggressively each token gets pushed, but the sign of the push is decided entirely by whether the rollout was actually right. The clipping plays the same trust-region role as in GRPO, just on the credit redistribution magnitude instead of the policy step size. This recovers the dense per-token supervision that made Self-Distilled Reasoner work, while pinning the direction to an external verifier that the model cannot game by imitating its own PI-conditioned outputs. 

A nice way to read the section as a whole: Self-Distilled Reasoner set the template, SDFT, SDPO, GATES and CRISP all stretch the definition of PI, and RLSD argues the template needs an external anchor or it eats itself. That tension between "PI is enough" and "PI plus a verifier" is roughly where the OPSD literature sits right now.

Before moving to failure modes, here is the OPSD landscape at a glance. Read the columns as "what is the privileged signal", "what is the loss doing with it", and "what does this paper specifically add over the one before it".

| Method | Privileged signal (PI) | Loss / signal | Distinguishing move | Best used when | Example / demo |
|---|---|---|---|---|---|
| Self-Distilled Reasoner | Reference solution $y^{*}$ | Per-token divergence between PI-conditioned and no-PI views on student rollouts | Establishes the OPSD template, log-ratio between two views of the same weights as a soft teacher | You have worked solutions for the task and want to internalise them without an external teacher model. | Algebra tutoring: student attempts first, then PI-conditioned view upweights missing intermediate steps and downweights wrong transformations. |
| SDFT | Task context for the new task | Same log-ratio-as-advantage shape | EMA-updated teacher checkpoint as the PI-conditioned view, framed for continual learning | You are doing continual fine-tuning and want to avoid forgetting without a frozen teacher. | Continual adaptation: EMA teacher avoids stale guidance from frozen checkpoints and noise chasing from fully live checkpoints. |
| SDPO | Environment textual feedback $f$ | Per-token KL with stop-gradient on the self-teacher $q_\theta(\cdot \mid x, f)$ | Treats verifier text (stack traces, judge notes) as a free teacher instead of throwing it away for a scalar | Your verifier already emits useful text and you are burning it by reducing the reward to a single number. | Code repair: traceback text becomes PI, and distillation transfers reusable bug-fixing token patterns. |
| GATES | Retrieved document for QA | Off-policy trajectory NLL on gated tutor traces + on-policy clipped advantage term | Consensus gate $g_i$ filters out prompts where the document-grounded tutor is unreliable | The PI-conditioned teacher is sometimes wrong and you have no verifier to fall back on. | Document QA: only high-consensus tutor trajectories are distilled, so noisy retrieval-conditioned answers are filtered out. |
| CRISP | A single "be concise" instruction $c$ | Per-token reverse KL between with-instruction and without-instruction views | Shows PI can be a one-line style prompt and still drive automatic difficulty-aware compression | You want to compress reasoning length without writing a difficulty estimator or token-budget heuristic. | Explanation compression: concise-instruction teacher drops repetitive scaffolding, and student learns shorter but still correct traces. |
| RLSD | A rationale / PI context $r$ plus a verifier | PPO-clipped objective with $w_t = (P_T/P_S)^{\mathrm{sign}(A)}$ | Magnitude from the self-distillation log-ratio, direction from the verifier's $\mathrm{sign}(A)$ | Pure OPSD runs are drifting or leaking PI shape onto no-PI prompts, and you have a binary verifier available. | Proof verification: PI controls update magnitude token-by-token, verifier correctness flips the sign of the update. |

The column to track across the rows is "distinguishing move". The objective is essentially fixed once Self-Distilled Reasoner writes it down. What changes is what counts as PI and, in the RLSD case, what gets to decide the sign of each update.

## Failure Modes of On-Policy Distillation

The OPD and OPSD literature spends most of its effort on the recipe and considerably less on what breaks when you actually run it on a long task with a real teacher. Three recent papers confront this head-on: a [survey by Song and Zheng](https://arxiv.org/abs/2604.00626) that formalises OPD as f-divergence minimisation over student trajectories and consolidates the recurring failure modes, [Fu et al.](https://arxiv.org/abs/2603.25562) on the empirical failure modes of token-level OPD specifically, and [Zhang et al.](https://arxiv.org/abs/2604.16830) on what privileged information does to a model's calibration. Reading them alongside the threads the earlier sections have already exposed gives a fairly coherent picture of where this family of methods breaks. The failure modes split roughly along four axes, estimator-level, teacher-level, information-gap-level, and multi-turn/agentic, with a fair amount of cross-talk between them. Let's have a look: 
### The token-level KL is a fragile proxy

Every OPD and OPSD method in this post optimises a *token-level sampled* reverse KL on student rollouts. That is not the same object as the *sequence-level* reverse KL the theory in MiniLLM and GKD actually wants to minimise. Fu et al. show that the token-level estimator is biased relative to the sequence-level one, and that the bias is not benign: it gets worse with stronger future-reward coupling along the trajectory, which is exactly the regime long-horizon reasoning lives in. A controlled synthetic study in the same paper shows that this coupling also inflates gradient variance and visibly destabilises training. So you get a worse target *and* a noisier estimate of it, on the kind of trajectories the field most wants to train on.
![image](https://hackmd.io/_uploads/r1ON1celze.png)

This is the same instability AOPD was already pushing against from the gradient side, where negative-advantage tokens have heavy-tailed gradients that swamp the positive-advantage ones.The general pattern is the one that always shows up, you trade a clean target you cannot estimate for a noisy target you can, when you replace a population-level objective with a sampled per-token surrogate

### Prefix drift and the unreliable teacher

The on-policy claim is that student rollouts are what matter, since those are the trajectories the student will actually visit at inference. The hidden assumption is that the teacher still has useful things to say on those rollouts. That assumption breaks faster than people expect. Once the student's prefix has drifted a few tokens past the teacher's typical support, the teacher's next-token distribution is skewed to extrapolate into a region of sequence space it was never trained on. In practice this shows up as the teacher placing near-uniform or weirdly spiky mass on irrelevant tokens, and the student earnestly distilling that noise.
![image](https://hackmd.io/_uploads/HkQXhKllfl.png)

A degenerate boundary case worth naming separately is *gradient SNR collapse*: on prompts where the student's pass rate is near zero, every rollout contains an early error, the teacher's signal on those rollouts is essentially noise on top of noise, and the per-prompt gradient SNR vanishes. The model learns nothing from exactly the prompts it most needs to learn from. This is the dual of prefix drift, but at the prompt level rather than the trajectory level.

### Local teachability collapse

A subtler version of the prefix problem shows up even when the teacher stays well-calibrated along the rollout. The teacher is still right about which token is best, just barely, and the token-level KL gradient carries almost no signal even though nothing has broken in the usual sense. It is called the *local teachability collapse* and argue for truncating OPD supervision at a downward change point in the teacher's local margin rather than at a fixed prefix length. The signal degrades gradually, and the optimal supervision window is trajectory-specific rather than positional.
![image](https://hackmd.io/_uploads/r1SbJ9gefl.png)

### Imbalanced supervision, Rock Tokens, and tokenizer mismatch

Three smaller but persistent issues round out the estimator-level failures. The first is that token-level loss is imbalanced by construction. A small number of tokens carry most of the loss and the rest contribute near-zero gradient, so an "average" reduction across a sequence is dominated by a few outliers. The second is what is called *Rock Tokens*: up to 18 percent of tokens in a trained OPD model exhibit persistently high loss even after the rest of the run has saturated, and they absorb a disproportionate share of the gradient norm because of their high occurrence frequency. The student cannot, and probably should not, internalise them from the teacher. So a real chunk of OPD's optimisation bandwidth is being spent on tokens that neither improve nor inform the student's reasoning capability. The third issue is that the teacher and the student often do not share a tokenizer, or share most of one but disagree on special tokens and chat templates. Naive token alignment then produces silent corruption that looks like a normal training run, but degrade performance. 
![image](https://hackmd.io/_uploads/By77k9lefg.png)

### Diversity collapse and the precision-recall trade-off

As the student concentrates mass on high-quality outputs, it necessarily shrinks coverage of the teacher's full output distribution: Pass@1 goes up, Pass@k goes down. In aggressive reverse-KL OPD this shows up as the student collapsing onto a single reasoning strategy per prompt, which looks great on greedy decoding and terrible the moment you sample. The on-policy half of OPD partly counterweights this since the student is generating its own diverse rollouts, and temperature-scaled sampling in GKD and G-OPD gives an explicit control knob, higher temperature trades gradient noise for distributional coverage. But the underlying trade-off does not go away, and any OPD recipe that does not measure both ends of it is reporting half the picture.
![image](https://hackmd.io/_uploads/H1XLyceeMx.png)


### The privileged information gap and calibration

The OPSD-specific failure mode is more subtle. The teacher is the same weights as the student, but conditioned on PI the student will not have at inference. So the teacher's *next-token confidence* is being computed under information the deployed model literally cannot see. 

Teacher-conditioned success, $P(\text{correct} \mid x, \mathrm{PI})$, is generally not a valid target for deployment-time confidence, $P(\text{correct} \mid x)$. When the PI is informative those two quantities can differ by a lot, and OPSD methods that distil the first as if it were the second are not just suboptimal on calibration, they are aimed at the wrong distribution. 
![image](https://hackmd.io/_uploads/ryVPk9leze.png)

A related self-distillation pathology is *epistemic suppression*. OPSD methods that compress reasoning traces, CRISP being the most explicit, disproportionately delete hedging phrases and uncertainty markers, since those are exactly the high-entropy tokens that the teacher's PI-conditioned view drops. The result is a shorter trace that is also more confidently wrong on the steps it elided, because the student has lost the calibration signal that uncertainty language was implicitly carrying. 

 
### Teacher capability is doing more work than these methods admit

Every method in this post tacitly assumes a teacher that, given the PI, is meaningfully better than the no-PI student on the metric you care about. If that gap is small the whole pipeline degenerates. A marginally better teacher gives a marginally better per-token signal, and the dense supervision turns into dense noise rather than dense signal. The story stops working as soon as the model is asked to self-distil on tasks where it is bad with the PI too. The student effectively learns a PI-free policy that aggregates PI-conditioned teachers, which works when PI is a *shared latent rule* (a system prompt, a style instruction) but degrades when PI is *instance-specific* (a particular worked solution, a particular document). 

![image](https://hackmd.io/_uploads/SyFKyqlxGx.png)


### My read

The deepest issue, in my read, is proxy mismatch. KL between PI-conditioned and no-PI views is used as a per-token update signal, but it mixes true informational gain with estimator noise from unequal information states. Even when optimisation is stable, the target can still be wrong.

Relatedly, distillation often imitates one specific PI instantiation when the real goal is PI-invariant skill transfer. A better target is the average behaviour over a family of valid PIs,
\begin{equation}
    \bar{\pi}(\cdot \mid x)
    =
    \mathbb{E}_{r \sim p(r \mid x)}
    \left[
    \pi_{\theta}(\cdot \mid x, r)
    \right].
\end{equation}
This object captures skill without overfitting to one hint realization. It also explains the empirical split: shared-rule PI transfers better than instance-specific PI. My expectation is that robust OPSD will be hybrid, keeping self-distilled magnitudes but anchoring direction and calibration with external verifiers.


## Open Problems

The failure modes section is the negative version of a roadmap. Reading it through, a few problems stand out as the ones the field would actually have to solve to make OPD and OPSD reliable post-training tools rather than a collection of recipes. None of these are settled.

### From PI-only to PI plus verifier

RLSD's split (magnitude from self-distillation, direction from a verifier) is probably the early version of where the OPSD literature converges, but it leaves the obvious next question open. What does a *soft* verifier look like? A binary correctness signal is a crude object to be controlling the sign of every per-token update with, and most real tasks do not have one anyway. The interesting versions of this problem are partial-credit verifiers for long-form generation, process-reward-model verifiers that can give per-step rather than per-trajectory signal, and verifier-of-verifier schemes where one model's confidence on a rollout acts as the direction signal for another's distillation magnitude. None of these have clean OPD-shaped formulations yet.

### Distilling skills rather than PI shapes

The $\bar{\pi}(\cdot \mid x) = \mathbb{E}_{r}\left[\pi_{\theta}(\cdot \mid x, r)\right]$ formulation from the last section is a target, not a method. The actual open problem is how you build the family $p(r \mid x)$ in the first place. Possible directions: paraphrase ensembles over a reference PI, a small "PI generator" model trained to produce diverse hints for a prompt, or sampling-based estimators that average teacher views across multiple stochastic conditionings of the same source signal. Any of these would push OPSD from "distil one specific PI" toward "distil the conditional distribution over PIs", which is the object you actually want.

### Uncertainty-aware feedback

CaOPD is the first paper to take the *capability vs calibration* split seriously, but it is also explicitly a single intervention rather than a framework. The harder version of the problem is to bake calibration into the OPD objective itself: a loss that penalises the student for being more confident than the teacher would be under the student's deployment-time information, not the teacher's training-time information. Local teachability collapse and Rock Tokens belong here too, both are diagnostics that say "the teacher has nothing more to teach on this token", and OPD currently has no way to act on that other than truncating supervision. A useful future objective would let the teacher's *informativeness* (margin, calibration, agreement across multiple PI samples) actively gate the per-token loss instead of contributing to it uniformly.

### Cross-tokeniser OPD as a structural unlock

The shared-tokeniser assumption sits quietly underneath every method in the main body, and it is also probably the single biggest constraint on how much post-training mileage OPD can actually deliver. If cross-tokeniser OPD worked as cleanly as same-tokeniser OPD, the recipe stops being "distil from a teacher in your own model family" and starts being "distil from whichever model in the world is best at the thing you care about". A small open-weights student could pull reasoning from one frontier model, tool-use from another, multilingual coverage from a third, and code from a fourth, all in the same post-training run. That is a qualitatively different unlock from anything in the current literature, and it is the version of OPD that would make it a genuine successor primitive to RLVR rather than a complement to it.

The current state is that the three families of methods covered in Appendix A (optimal transport, dual-space projection, vocabulary alignment) all *work*, but each comes with a noticeable quality tax compared to the same-tokeniser baseline. The on-policy version of the problem is harder still, since student rollouts have to be re-scored under the teacher's vocabulary at every step and the boundary-mismatch noise compounds along the trajectory. GOLD and CTPD have made this practically accessible inside TRL, but the underlying alignment is still the bottleneck. The interesting open versions of the problem are an OT formulation that is aware of trajectory structure rather than per-token shape only, a dual-space projection learned jointly with the student rather than fixed up front, and a byte-level or character-level interface that removes the tokenizer dependency entirely. Any one of these landing cleanly would change the menu of teachers available to every post-training pipeline at once, which is why this is plausibly the highest-leverage open problem on the list.

## OPD in the wild

This is no longer just a research curiosity. Across reports from [Qwen3](https://arxiv.org/abs/2505.09388), [DeepSeek-V4](https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf), [GLM-5](https://arxiv.org/abs/2602.15763), and [MiMo-V2-Flash](https://arxiv.org/abs/2601.02780), OPD appears as a practical consolidation stage after heavier SFT/RL phases. The common industrial pattern is: build capability with pretraining and RL-style optimisation, then use on-policy distillation to preserve behaviour while compressing into a deployable student. The details differ by lab, but the role OPD plays is increasingly consistent.

## Final thoughts

My own bet is that the next generation of post-training pipelines will treat OPD the way the current one treats RLVR, as a default stage that runs after the heavier-cost components and pays for itself in deployment economics. The interesting research questions are not whether OPD works (it does), they are which version of it ends up being the one everyone runs by default, and what the failure modes look like at the next order of magnitude of teacher capability. I expect a fair amount of the early literature in this post to get partially obsoleted by whichever cleaner formulation lands first, and that is mostly a good thing. The field is young enough that the right answer probably is not in any single paper yet, but it is recognisably close.

Thanks for reading, I had a ton of fun researching this one and building my own intuition for what is happening underneath the recipe, including a few small experiments on the side, about which i would post in the coming days, and derived most of the takes here! If you got this far, I would genuinely love to talk about any of this. I'm happy to discus about anything in the post, especially the bits where you think I got it wrong.

## Appendix

### A. Cross-tokeniser distillation

Almost everything in the main body assumed teacher and student share a tokenizer. In practice they often do not, which means the teacher's logits and the student's rollouts live in different vocabulary spaces and cannot be compared directly. This breaks every divergence-based OPD method in the post unless you do something about it. Three rough families of fixes have emerged.

The first is *optimal-transport-style alignment*. [ULD](https://arxiv.org/abs/2402.12030) (Universal Logit Distillation) sorts each model's logits by probability and matches them position-by-position via an optimal transport cost, sidestepping vocabulary identity entirely. The intuition is that what matters for distillation is the *shape* of the distribution (entropy, top-K mass, tail behaviour) rather than which specific token IDs carry that shape. This works surprisingly well when the two vocabularies cover comparable surface text, and degrades on languages or domains where one vocabulary is much coarser than the other. [Multi-Level OT](https://arxiv.org/abs/2412.14528) extends the same idea to both token- and sequence-level transport, and [the byte-level interface](https://arxiv.org/abs/2604.07466) sidesteps the alignment problem entirely by converting teacher distributions into byte-level scores and decoding the student at the byte level.

The second is *dual-space projection*. [DSKD](https://arxiv.org/abs/2504.11426) and related methods train a small projection from the teacher's logit space into the student's, or both into a shared latent space, and run KD in that shared space. The cost is an extra learnable component and the dependence on a good projection; the benefit is that you keep token identity rather than discarding it.

The third is *vocabulary-level alignment*. Methods in this family (MinED-style edit-distance matching and its successors, [CDM](https://arxiv.org/abs/2502.11104) for contextual dynamic mapping, [DWA-KD](https://arxiv.org/abs/2602.21669) using Soft-DTW differentiable sequence alignment, [SimCT](https://arxiv.org/abs/2605.07711) recovering supervision via short multi-token continuations) try to find an explicit token-to-token correspondence between the two vocabularies, splitting and merging where necessary so that per-token loss can be computed directly. This is the most faithful to the original divergence-based formulation and the most fragile to special tokens and chat templates, which is exactly where Fu et al.'s "tokenizer mismatch distortion" failure mode bites hardest.

For OPD specifically, the cross-tokeniser problem has an extra wrinkle: the student rollouts are in the student's tokens, but the teacher needs to score them. Re-tokenising the student's rollout under the teacher's vocabulary then scoring per token introduces silent boundary mismatches at every re-tokenisation step. The cleanest current practice is to do the alignment per-rollout rather than per-token, and to restrict the KL to the subset of positions where both tokenisations agree on a boundary. That is enough to make most published recipes work, but it is clearly not the final answer.

If you want to actually run cross-tokeniser OPD today rather than reimplement one of the papers above, the easiest entry point is [GOLD](https://huggingface.co/spaces/HuggingFaceH4/on-policy-distillation) (HuggingFace H4's *Unlocking On-Policy Distillation for Any Model Family*), which packages a cross-tokeniser OPD recipe directly into TRL. GOLD is not a new objective so much as an engineering integration, but it is the cleanest path from "I want OPD across two different model families" to a working training loop, and it is what most of the recent cross-family distillation walkthroughs in the field point at. [CTPD](https://arxiv.org/abs/2601.11865) is a useful complement when the goal is *preference* distillation across tokenisers, since it adds teacher-anchored DPO with cross-tokeniser importance sampling on top of the aligned-span projection that the other methods leave implicit.

The practical shape of all of this, when teacher and student come from genuinely different families (something GLM-shaped on one side and something Qwen-shaped or Gemma-shaped on the other, in the 4B-to-9B student range), is roughly the same regardless of which exact pair you pick. All three modern families use byte-level BPE with different merges and different chat templates, so the vocabularies overlap on most English-language content and disagree sharply on numbers, code, special tokens, and anything language-specific. The recipe that holds up across these setups is: strip the teacher's chat template off the rollouts before doing anything else and re-emit them under the student's chat template, so the teacher's `<|user|>`, `<|assistant|>` and friends never touch the student's tokenizer (this is the single biggest source of silent corruption and also the easiest to fix), then run the rollouts on-policy in the student's tokens and re-tokenise each one under the teacher's vocabulary for scoring, computing the loss only on positions where the two tokenizations agree on a boundary. For English prose that boundary-agreement rate sits around 80 to 95 percent of positions, for dense numerics or code it can drop to 40 to 60 percent, and on the unaligned positions the cleanest fallback is a sorted-logit ULD term rather than dropping the supervision entirely. That keeps signal flowing on exactly the regions where the tokenizers disagree the most without faking a one-to-one correspondence that does not exist. GOLD ships most of this out of the box with reasonable defaults, which is most of the reason it has become the default entry point for these setups. The two diagnostics worth keeping an eye on are a per-position alignment-rate plot (you want it flat across training and not silently drifting) and a held-out evaluation that explicitly mixes numeric, code, and multilingual prompts, since those are the slices where boundary mismatch hits hardest and aggregate metrics will hide it.

### B. Multi-teacher OPD

The other natural extension of the OPD recipe is to use more than one teacher. G-OPD already half-covers this on the theory side, since the reference model in the G-OPD objective can be anything, and stacking multiple references is a small change to the loss. The simplest multi-teacher OPD is just a weighted average of teacher distributions inside the KL:
\begin{equation}
    \pi^{*}(y \mid x)
    =
    \sum_{i=1}^{K}
    w_i \, \pi_{T_i}(y \mid x),
\end{equation}
with the $w_i$ either fixed by hand, set by validation, or computed per-prompt from each teacher's confidence on that prompt. Specialty-routing variants pick a single teacher per prompt or per token based on which one has the highest pass rate on the relevant slice of the data.

What is interesting about the current state of multi-teacher OPD is that the academic literature is thinner than the industrial usage. *Multi-Teacher On-Policy Distillation* (MOPD) is the explicit named stage in [MiMo-V2-Flash](https://arxiv.org/abs/2601.02780), and it shows up under different names in [GLM-5](https://arxiv.org/abs/2602.15763), [Nemotron-Cascade 2](https://arxiv.org/abs/2603.19220), [Baichuan-M3](https://arxiv.org/abs/2602.06570), and [DeepSeek-V4](https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf), each as the consolidation step that turns a stable of domain-expert checkpoints into a single deployable model. Yumo Xu's [survey post on MOPD](https://yumoxu.notion.site/multi-teacher-on-policy-distillation) is the best practitioner-level synthesis of the pattern across these releases. The shared move is *specialise then unify*: train one expert per domain with RL or SFT, then run multi-teacher OPD on student rollouts with each expert acting as the teacher on its own slice of the prompt distribution. [KAT-Coder-V2](https://arxiv.org/abs/2603.27703) is an explicit articulation of this recipe for coding agents, and [CoPD](https://arxiv.org/abs/2604.27083) is the bidirectional version where experts also co-evolve as mutual teachers during RLVR.

The interesting failure mode here is *teacher disagreement*. When the teachers all agree, multi-teacher OPD behaves like an effective ensemble and is strictly better than any single teacher. When they disagree, the naive averaged distribution is flatter than any individual teacher's, and a naive KL pushes the student toward something none of the teachers would actually have said. The industrial recipes mostly handle this by routing rather than averaging (one teacher per prompt, picked by domain), but the principled per-token version of the problem is still open. Given how heavily MOPD now sits in production pipelines, it is one of the easier places to do useful original work right now, the deployment gap between "MOPD ships at every frontier lab" and "the academic literature has roughly five papers on it" is unusually wide.

## References (and suggested full reads)

**Surveys and overviews**

1. Song, Zheng, et al. *A Survey on On-Policy Distillation for Large Language Models.* arXiv preprint, arXiv:2604.00626, 2026. [link](https://arxiv.org/abs/2604.00626)
2. Thinking Machines. *On-Policy Distillation.* Blog post, 2025. [link](https://thinkingmachines.ai/blog/on-policy-distillation/)
3. Yumo Xu. *Multi-Teacher On-Policy Distillation: A New Post-Training Primitive.* Notion essay, 2026. [link](https://yumoxu.notion.site/multi-teacher-on-policy-distillation)
4. [awesome-on-policy-distillation](https://github.com/chrisliu298/awesome-on-policy-distillation) (Awesome curated list, used as the scaffolding for this post).

**Foundational OPD**

5. Gu, Dong, et al. *MiniLLM: Knowledge Distillation of Large Language Models.* arXiv preprint, arXiv:2306.08543, 2023. [link](https://arxiv.org/abs/2306.08543)
6. Agarwal, Vieillard, et al. *Generalized Knowledge Distillation for Auto-Regressive Sequence Models (GKD).* arXiv preprint, arXiv:2306.13649, 2023. [link](https://arxiv.org/abs/2306.13649)
7. Ko, Kim, et al. *DistiLLM: Towards Streamlined Distillation for Large Language Models.* arXiv preprint, arXiv:2402.03898, 2024. [link](https://arxiv.org/abs/2402.03898)
8. Singh, Co-Reyes, et al. *Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models (RestEM).* arXiv preprint, arXiv:2312.06585, 2023. [link](https://arxiv.org/abs/2312.06585)
9. Yang, Liu, et al. *Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation (G-OPD / ExOPD).* arXiv preprint, arXiv:2602.12125, 2026. [link](https://arxiv.org/abs/2602.12125)
10. Ye, Dong, et al. *On-Policy Context Distillation for Language Models (OPCD).* arXiv preprint, arXiv:2602.12275, 2026. [link](https://arxiv.org/abs/2602.12275)

**On-policy self-distillation (OPSD)**

11. Zhao, Xie, et al. *Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models (OPSD).* arXiv preprint, arXiv:2601.18734, 2026. [link](https://arxiv.org/abs/2601.18734)
12. Shenfeld, Damani, et al. *Self-Distillation Enables Continual Learning (SDFT).* arXiv preprint, arXiv:2601.19897, 2026. [link](https://arxiv.org/abs/2601.19897)
13. Hübotter, Lübeck, et al. *Reinforcement Learning via Self-Distillation (SDPO).* arXiv preprint, arXiv:2601.20802, 2026. [link](https://arxiv.org/abs/2601.20802)
14. Stein, Huang, and Goldstein. *GATES: Self-Distillation under Privileged Context with Consensus Gating.* arXiv preprint, arXiv:2602.20574, 2026. [link](https://arxiv.org/abs/2602.20574)
15. Sang, Xu, et al. *CRISP: Compressed Reasoning via Iterative Self-Policy Distillation.* arXiv preprint, arXiv:2603.05433, 2026. [link](https://arxiv.org/abs/2603.05433)
16. Yang, Qin, et al. *Self-Distilled RLVR (RLSD).* arXiv preprint, arXiv:2604.03128, 2026. [link](https://arxiv.org/abs/2604.03128)
17. Zhang, Peng, et al. *The Illusion of Certainty: Decoupling Capability and Calibration in On-Policy Distillation (CaOPD).* arXiv preprint, arXiv:2604.16830, 2026. [link](https://arxiv.org/abs/2604.16830)
18. Kim, Luo, et al. *Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?* arXiv preprint, arXiv:2603.24472, 2026. [link](https://arxiv.org/abs/2603.24472)
19. Cha and Cho. *Why Knowledge Distillation Works in Generative Models: A Minimal Working Explanation.* arXiv preprint, arXiv:2505.13111, 2025. [link](https://arxiv.org/abs/2505.13111)
20. Li, Jiang, et al. *The Extrapolation Cliff in On-Policy Distillation of Near-Deterministic Structured Outputs (ListOPD).* arXiv preprint, arXiv:2605.08737, 2026. [link](https://arxiv.org/abs/2605.08737)
21. Jeong. *Healthcare AI Gym for Medical Agents (TT-OPD).* arXiv preprint, arXiv:2605.02943, 2026. [link](https://arxiv.org/abs/2605.02943)
22. Zhu, Ye, et al. *The Many Faces of On-Policy Distillation: Pitfalls, Mechanisms, and Fixes.* arXiv preprint, arXiv:2605.11182, 2026. [link](https://arxiv.org/abs/2605.11182)
23. Jiang, Li, et al. *Cornerstones or Stumbling Blocks? Deciphering the Rock Tokens in On-Policy Distillation.* arXiv preprint, arXiv:2605.09253, 2026. [link](https://arxiv.org/abs/2605.09253)
24. Fu, Huang, et al. *Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes (tokenizer mismatch distortion).* arXiv preprint, arXiv:2603.25562, 2026. [link](https://arxiv.org/abs/2603.25562)

**Cross-tokeniser distillation**

25. *ULD: Universal Logit Distillation.* arXiv preprint, arXiv:2402.12030, 2024. [link](https://arxiv.org/abs/2402.12030)
26. *DSKD: Dual-Space Knowledge Distillation.* arXiv preprint, arXiv:2504.11426, 2025. [link](https://arxiv.org/abs/2504.11426)
27. *Multi-Level Optimal Transport for Cross-Tokenizer KD.* arXiv preprint, arXiv:2412.14528, 2024. [link](https://arxiv.org/abs/2412.14528)
28. *CDM: Contextual Dynamic Mapping for Cross-Tokenizer KD.* arXiv preprint, arXiv:2502.11104, 2025. [link](https://arxiv.org/abs/2502.11104)
29. *DWA-KD: Differentiable Word Alignment Knowledge Distillation.* arXiv preprint, arXiv:2602.21669, 2026. [link](https://arxiv.org/abs/2602.21669)
30. *SimCT: Multi-Token Continuation Cross-Tokenizer Distillation.* arXiv preprint, arXiv:2605.07711, 2026. [link](https://arxiv.org/abs/2605.07711)
31. *Byte-Level Interface for Cross-Tokenizer Distillation.* arXiv preprint, arXiv:2604.07466, 2026. [link](https://arxiv.org/abs/2604.07466)
32. *CTPD: Cross-Tokenizer Preference Distillation.* arXiv preprint, arXiv:2601.11865, 2026. [link](https://arxiv.org/abs/2601.11865)
33. HuggingFace H4. *GOLD: Unlocking On-Policy Distillation for Any Model Family.* HuggingFace Space, 2025. [link](https://huggingface.co/spaces/HuggingFaceH4/on-policy-distillation)

**Industrial post-training reports**

34. Qwen Team. *Qwen3 Technical Report.* arXiv preprint, arXiv:2505.09388, 2025. [link](https://arxiv.org/abs/2505.09388)
35. Google DeepMind. *Gemma 2 Technical Report.* arXiv preprint, arXiv:2408.00118, 2024. [link](https://arxiv.org/abs/2408.00118)
36. Zhipu / GLM Team. *GLM-4.5 and GLM-4.6 Technical Report.* arXiv preprint, arXiv:2508.06471, 2025. [link](https://arxiv.org/abs/2508.06471)
37. Zhipu / GLM Team. *GLM-5 Technical Report.* arXiv preprint, arXiv:2602.15763, 2026. [link](https://arxiv.org/abs/2602.15763)
38. DeepSeek. *DeepSeek-V4 Pro Technical Report.* HuggingFace, 2026. [link](https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf)
39. Xiaomi. *MiMo-V2-Flash Technical Report.* arXiv preprint, arXiv:2601.02780, 2026. [link](https://arxiv.org/abs/2601.02780)
40. NVIDIA. *Nemotron-Cascade 2 Technical Report.* arXiv preprint, arXiv:2603.19220, 2026. [link](https://arxiv.org/abs/2603.19220)
41. Baichuan. *Baichuan-M3 Technical Report.* arXiv preprint, arXiv:2602.06570, 2026. [link](https://arxiv.org/abs/2602.06570)
42. Kuaishou. *KAT-Coder-V2 Technical Report.* arXiv preprint, arXiv:2603.27703, 2026. [link](https://arxiv.org/abs/2603.27703)
43. *CoPD: Bidirectional Co-Policy Distillation.* arXiv preprint, arXiv:2604.27083, 2026. [link](https://arxiv.org/abs/2604.27083)
44. Cursor. *Composer 2.5 Release Notes.* Blog post, 2026. [link](https://cursor.com/blog/composer-2-5)

### Suggested reads

1. [Thinking Machines blog on On-Policy Distillation](https://thinkingmachines.ai/blog/on-policy-distillation/), still the single best practitioner-level intro.
2. [Yumo Xu's MOPD essay](https://yumoxu.notion.site/multi-teacher-on-policy-distillation) for the clearest writeup of multi-teacher OPD as an industrial primitive.
3. The [GOLD HuggingFace Space](https://huggingface.co/spaces/HuggingFaceH4/on-policy-distillation), which is the easiest way to actually run cross-tokeniser OPD today.
4. The [Song & Zheng survey](https://arxiv.org/abs/2604.00626), which is the closest thing to a single reference for everything in this post.