Distillation Robustifies Unlearning

Table of Contents
Robust unlearning probably reduces AI risk
Perfect data filtering is the current unlearning gold standard
Oracle matching does not guarantee robust unlearning
Distillation robustifies unlearning
Robustness scales with compute
Undo is better than other unlearning methods
Where this leaves us
Limitations
Insights and speculation
Future directions
Conclusion
Citation
Footnotes

Current “unlearning” methods only suppress capabilities instead of truly unlearning the capabilities. But if you distill an unlearned model into a randomly initialized model, the resulting network is actually robust to relearning. We show why this works, how well it works, and how to trade off compute for robustness. Since labs already distill some models before deployment, our work implies they might achieve robust unlearning “for free” on those models by simply applying an unlearning step before distillation.

Unlearn-and-Distill applies unlearning to a bad behavior and then distills the unlearned model into a new model. Distillation makes it way harder to retrain the new model to do the bad thing.

Distilling the good while leaving behind the bad.

Thanks

Produced as part of the ML Alignment & Theory Scholars Program in the winter 2024–25 cohort of the shard theory stream. Read our paper and enjoy an interactive demo.

Maybe some future AI has long-term goals and humanity is in its way. Maybe future open-weight AIs have tons of bioterror expertise. If a system has dangerous knowledge, that system becomes more dangerous, either in the wrong hands or in the AI’s own “hands.” By making it harder to get AIs to share or use dangerous knowledge, we decrease (but do not eliminate) catastrophic risk.

Misuse risk: Robust unlearning prevents finetuning attacks from easily retraining a model to share or use the unlearned skill or behavior. Since anyone can finetune an open-weight model, it’s not enough to just suppress the model before releasing it. However, even closed-source models can be jailbroken. If the capability is truly no longer present, then a jailbreak can’t elicit an ability that isn’t there to begin with.
Misalignment risk: Robust unlearning could remove strategic knowledge and skills that an unaligned AI might rely on. Potential removal targets include knowledge of: AI control protocols or datacenter security practices; weight exfiltration; self-modification techniques; the fact that it is an AI system; or even the ability to be influenced by negative stereotypes about AI. Robust unlearning could maybe even cripple an AI’s hacking or biology skills, or make it a less convincing liar.; Perhaps robust unlearning simply makes it harder for an AI to reason about an area, but doesn’t stop the AI entirely. That outcome would still be less risky.

Data filtering removes the training data related to the undesired capabilities. Sadly, data filtering is usually impractical.

It’s hard and expensive to identify all of the training data across the entire pretraining corpus that contributes to an unwanted capability.
Work on gradient routing showed that when data is filtered imperfectly, the filtering quickly loses effectiveness.
Sometimes, dangerous capabilities can come from combinations of seemingly safe data.

If we want practical robust unlearning, we probably need a different approach.

Most unlearning methods try to make a model forget a specified capability by finetuning it. However, finetuning usually only teaches the model to suppress the behavior, not remove the underlying capability.

We show this limitation persists even in the idealized setting of finetuning a model to exactly match the outputs of an oracle model that has never learned the specified capability in the first place. We take a model pretrained on both retain and forget data and finetune it on the logits of an oracle model, which was trained only on the retain data. Before subjecting the models to a relearning attack, the finetuned model behaves nearly identically to the oracle, but when we retrain both to relearn the forgotten capability, the finetuned model picks it up much faster. The capability wasn’t erased; it was just hidden.

Despite minimal initial differences in behavior (i.e. logits), the student model initialized from the full pretrained model (trained on both retain and forget data) relearns the “unlearned” capability much faster than either the oracle or a randomly initialized student on which we performed oracle distillation.

*Matching oracle behavior doesn’t guarantee robust unlearning.* Graph (a) shows the loss during distillation of the oracle into the pretrained and randomly initialized students. Graphs (b) and (c) show forget performance through retraining for Language and Arithmetic settings, respectively.

The faster relearning implies that finetuning a pretrained model to have certain outputs is not sufficient for robust unlearning. The weights still contain the capability, but the model just learned how not to show that capability.

In a dark classroom, a shadowy teacher figure with biohazard symbols behind them transfers a glowing ribbon of knowledge to a luminous student. The student studies a book of simple shapes, illustrating the distillation of good knowledge while leaving behind the bad.

Imagine you’re an algebra student and your teacher pretends not to know algebra. Despite the fact that the teacher does know it themselves, you as a student will not learn.

Similarly, you might expect that when distilling a model, only the expressed behaviors are transferred and the latent capabilities are not. We show this is true. Distilling a conventionally unlearned model into a randomly initialized model creates a student that is robustly incapable of the forget capability.

We call this method Unlearn-and-Distill, and it has two phases:

Unlearn: Apply a standard unlearning method to a pretrained model.
Distill: Train a randomly initialized model to match the outputs of the unlearned model.

On both language and arithmetic tasks, we apply Unlearn-and-Distill using three different unlearning methods. We finally apply relearning attacks to test robustness.

Method	Description
Unlearn	The original finetuned model
Unlearn-and-Distill	Our method
Gold Standard Data Filtering	A model trained with the forget data removed entirely

Domain	Retain evaluation	Forget evaluation
Language	English text CE loss	Korean text CE loss
Arithmetic	Addition and subtraction problems accuracy	Multiplication and division problems accuracy

*Comparing unlearning methods.* Each graph depicts the relearning trends on forget data for the initial unlearning method (Unlearn), Unlearn-and-Distill, and Data Filtering (Gold Standard). The rows separate the settings (language and arithmetic), and the columns separate the initial unlearning methods (GradDiff, Maxent, and rmu).

Across the board, Unlearn-and-Distill is more resistant to relearning than its unlearned-only counterpart. In some cases, it’s nearly as robust as the gold standard. This supports the idea that latent capabilities are present in the original model parameters but don’t transfer to fresh parameters during distillation. Occasionally, like with rmu / Arithmetic, the initial unlearning is poor, and the distilled model relearns quickly. This shows that if suppression is too weak, the capability can still “leak through” and be reconstructed.

Interactive demo

In our demo, compare the answers of the “unlearned” model with the Unlearn-and-Distill model. The code and experimental framework are available on GitHub.

While data filtering requires training a model from scratch, Unlearn-and-Distill only requires some finetuning (unlearning) and then distillation. That’s reasonably cheap, but it can still take a fair chunk of compute. We develop a method to flexibly trade off between compute and robustness.

We introduce undo (Unlearn-Noise-Distill-on-Outputs), a generalization of our earlier Unlearn-and-Distill method. It’s a three-step process:

Unlearn. Apply a standard unlearning method to a pretrained model to suppress the undesired behavior.
Noise. Corrupt the weights of the suppressed model and initialize the student as this damaged model.
Distill. Repair this damaged student by distilling.

To inject noise, we use a shrink-and-perturb procedure that controls damage via a parameter $α$ (higher $α$ means more damage). We then distill until the student recovers 95% of the teacher model’s retain performance.

*Unlearning robustness scales with more perturbation.* (a, c) show the trade-off between robustness and compute. (b) shows relearning trends for language with $α \in {0.2, 0.4, 0.6, 0.8}$ . (d) shows relearning trends for arithmetic with $α \in {0.55, 0.65, 0.7, 0.75}$ .

In plots (a) and (c), as $α$ increases, training takes longer and the final model becomes more robust to relearning. Surprisingly, the relationship seems approximately linear. In plots (b) and (d), increasing $α$ increases robustness, slowing down the relearning speed during relearning attacks. In other words, undo lets you trade off compute for robustness to relearning.

What we ultimately want from a robust unlearning method is to push the Pareto frontier of initial retain performance vs. forget performance. The frontier must hold up against an adversary who is trying to maximize forget performance given a certain compute budget.

*Comparing unlearning methods across different adversarial strengths.* We vary each method’s hyperparameters and plot their retain and relearned forget performance.

*Column 1:* Initial performance after unlearning but before adversarial attacks.
*Column 2:* Relearned forget performance after moderate relearning (40 steps).
*Column 3:* Performance after extensive relearning (500 steps).

In both settings, undo consistently dominates. Many of the methods get good initial retain-forget trade-offs, but rapidly degrade under adversarial pressure. In contrast, undo maintains more robust unlearning performance across all explored attacks and approaches the gold standard without requiring infeasible data labeling.

We also tested undo on the Weapons of Mass Destruction Proxy benchmark with Gemma-2-2b. Undo consistently increased resilience to relearning, and it fell on the Pareto frontier of methods. However, we were more constrained on data and compute here compared to our synthetic arithmetic and language experiments, relative to what was used in pretraining. We struggled to recover performance in the distillation step. We expect that model developers will have enough resources to scale undo to these larger models.

Our method depends on the quality of the initial unlearning.

Poor initial suppression leads to less robustness gains: Even if the unlearned model does not demonstrate the forget capability, it still may share the necessary information in its logits for the capability to be transferred to a student during distillation. For real-world cases, can we reliably achieve suppression strong enough for undo to succeed? We think so.
Poor suppression might lead to slower distillation: Perhaps inconsistent or noisy behaviors make the target logit function harder to predict.
We only tested against relearning attacks: The unlearning literature considers these finetuning attacks to be the strongest kind, but it’s possible that undo somehow is vulnerable to other elicitation approaches.

The oracle matching experiment shows that logits do not fully reflect a model’s capabilities. We demonstrate path-dependent inductive biases: for two models with nearly identical logit outputs, the models have different abilities to learn the forget set information.

Distillation is not just a compression tool¹: Distillation also changes safety-relevant model properties—e.g. distillation makes unlearning robust. If you first unlearn and then distill into a randomly initialized student, the student keeps the desired behavior but loses the unwanted capability, even under relearning attacks. In a sense, distilling a suppressed model allows you to only train on “good” data.² undo decomposes “robust unlearning” into “choice of shallow-unlearning method” and “how distillation is performed.”; Therefore, developers can mix and match suppression methods. As suppression / shallow unlearning improves, so does undo!
We can trade off compute and robustness: Perturbing the model’s weights damages the trained model’s capabilities as a function of the size of the perturbation. In this way, we can modulate both the compute needed and the robustness gained.
Labs already distill production models, so Unlearn-and-Distill might be cheap and easy: Distillation typically happens before post-training, for several potential reasons. For example, by distilling first, labs can tweak post-training without re-distilling. It’s cheaper to post-train a smaller (distilled) model. There could also be optimizations that apply only when distilling the base pretrained model. The true cost of Unlearn-and-Distill or undo depends on how labs implement distillation.

A common concern is that sufficiently capable models might just rederive anything that was unlearned by using general reasoning ability, tools, or related knowledge. Several thoughts in response:

In real life, it’s harder to reason about an area if you don’t have relevant experience or knowledge. After all, what are we unlearning if not domain-specific heuristics and knowledge access? Unlearning might not stop such smart systems from reasoning about biology, but it probably makes it harder.
We think there is likely to be a window of time where models are dangerous enough to warrant unlearning, yet not capable enough to rederive the removed information.
Making dangerous capabilities require more reasoning or tool use makes it easier to detect when they’re being used.
In many cases, the specifics matter. For example, exactly what security measures are in place around a datacenter? Such details may be difficult to rederive.

Scaling Unlearn-and-Distill and undo into settings that are closer to practical applications. For example, performing full $α$ sweeps and Unlearn-and-Distill for the Weapons of Mass Destruction Proxy benchmark. More speculatively, undo’ing deception or sycophancy.
Run the distillation step by matching model internal activations rather than logits. This should be runnable from our codebase.
Exploring other techniques to damage the model, such as pruning or targeted damage.

Undo is a viable approach for creating genuinely capability-limited models. While other methods merely suppress surface behaviors, our experiments indicate that undo prevents capabilities from being easily recovered. By folding unlearning into an already common practice, we hope that this line of work helps make real robust unlearning a reality.

Acknowledgments

We gratefully acknowledge:

Henrik Marklund for his insightful comments at various points of the project;

Vivek Hebbar, Andis Draguns, and Jake Mendel for helpful comments on our abstract;

Rishub Tamirisa for the guidance in navigating wmdp benchmarking procedures;

Eric Easley for sharing valuable strategies for wmdp dataset cleaning and productive discussions about potential improvements to our method;

Iftekhar Uddin and Laura Vaughan for facilitating access to computational resources and funding support;

Mats for enabling our collaboration.

Join Team Shard

Want to become more skilled at alignment research? Apply to work with us later this year in the next round of mats.

@misc{lee2025distillationrobustifiesunlearning,
      title={Distillation Robustifies Unlearning}, 
      author={Bruce W. Lee and Addie Foote and Alex Infanger and Leni Shor and Harish Kamath and Jacob Goldman-Wetzler and Bryce Woodworth and Alex Cloud and Alexander Matt Turner},
      year={2025},
      eprint={2506.06278},
      archivePrefix={arXiv},
      url={https://arxiv.org/abs/2506.06278}, 
}

Find out when I post more content: newsletter & rss

Thoughts? Email me at alex@turntrout.com (pgp)

Other work has used distillation in contexts other than model compression, including improving performance, dataset privacy protection, and continual learning. ⤴
Distilling unlearned teacher logits isn’t always similar to just filtering out the forget data. In a TinyStories setting, we unlearned “ability to tell stories involving trees” from a small teacher model. Then we distilled its logits into a small student model. However, the student was vulnerable to relearning attacks, which wouldn’t have happened if we had performed data filtering. ⤴