Good teachers don’t cheat

Jason Ken Adhinarta — Wed, 03 Jun 2026 04:00:00 GMT

TL;DR: Policy gradient RL, self-distillation techniques like SDFT, and Pedagogical RL can all be viewed as optimizing the same objective , just with slightly different optimization procedures. The privileged information that some of these methods feed in context is simply a tool to make the optimization of easier. The punchline is that, at optimality, the teacher’s use of has to vanish: good teachers don’t cheat.

Background

Suppose we have a base policy and we are interested in learning the policy

With binary rewards and , this collapses to i.e., the base policy restricted to correct answers.¹

We can characterize RL as an ELBO bound on the partition function :

For any distribution (i.e., implies ), we can rewrite this as an importance-weighted expectation under :

By Jensen’s inequality and concavity of ,

where equality holds iff is constant over all with ; namely, when . It is well known that

and this is the objective that policy gradient methods (GRPO, etc.) typically solve.

As a caveat, we note that some policy gradient methods actually train with for RLVR problems. However, from RL’s Razor we know that on-policy RL methods implicitly bound KL divergence, making this a reasonable assumption.

Privileged information, and why good teachers don’t cheat

Now suppose we have privileged information (for example, is an expert demonstration of a problem ) that makes it easier to obtain high rewards. In the simplest case for some deterministic function (e.g., a lookup table over stored answers).

Fix and parameterize a distribution with . Applying the ELBO bound from above, we have almost trivially:

That is, for any , the optimal that achieves the bound is

which does not depend on !

The KL term is what makes this work, despite being possibly vanishingly small. Suppose is an expert demonstration containing the final answer, say . Then achieves the optimal reward, but it has extremely high KL divergence from , and is therefore not feasible to distill into the student. It is difficult to distill a teacher which simply repeats the final answer into the student.

RL, self-distillation, and Pedagogical RL are all equivalent

This gives us two ways to optimize the same objective:

Start from . Sparse rewards, but . GRPO and the like.
Start from . Dense rewards, but large KL. Pedagogical RL and on-policy-distillation follow this paradigm.

Importantly, KL divergence is significantly easier to optimize than sparse rewards, since rich gradients can be derived from full per-token logits (see this).

The above motivates the following two-stage procedure:

Train the teacher to optimize . At optimality, loses its dependence on .
Distill the teacher into the student via KL minimization, since is not available at test time. The direction of the KL is exactly the choice between two distillation methods:
1. Minimizing is exactly on-policy distillation.
2. Minimizing is equivalent to maximizing , which is standard off-policy SFT, which Pedagogical RL argues is more sample-efficient than on-policy distillation.

So policy gradients, self-distillation, and Pedagogical RL can be interpreted as three different optimization procedures for the same underlying objective , differing in their use of and the direction of KL minimization.

Self-distillation woes

The problem with most on-policy distillation techniques is that the teacher is never trained to optimize , step 1 is skipped! As a result, the teacher is not guaranteed to lose its dependence on , and on-policy distillation leaks the privileged information straight into the student. This probably causes the various stability issues observed in on-policy-distillation methods, with student and teacher logits diverging irreconcilably.

This post hints at the direction of some ongoing work; more to come!

Footnotes